ml4qc.surveymlclassifier module

Module for using machine learning classification techniques on survey data.

class ml4qc.surveymlclassifier.SurveyMLClassifier(x_train_df: DataFrame, y_train_df: DataFrame, x_predict_df: Optional[DataFrame] = None, test_size: Optional[Union[float, int]] = None, cv_when_training: bool = False, control_features: Optional[list[str]] = None, n_jobs: int = -2, random_state: Optional[Union[int, RandomState]] = None, categorical_features: Optional[list[str]] = None, verbose: Optional[bool] = None, calibration_method: Optional[str] = None, threshold: str = 'default', threshold_value: Optional[float] = None)

Bases: SurveyML

Class for using machine learning classification techniques on survey data.

__init__(x_train_df: DataFrame, y_train_df: DataFrame, x_predict_df: Optional[DataFrame] = None, test_size: Optional[Union[float, int]] = None, cv_when_training: bool = False, control_features: Optional[list[str]] = None, n_jobs: int = -2, random_state: Optional[Union[int, RandomState]] = None, categorical_features: Optional[list[str]] = None, verbose: Optional[bool] = None, calibration_method: Optional[str] = None, threshold: str = 'default', threshold_value: Optional[float] = None)

Initialize survey data for classification using machine learning techniques.

Parameters

x_train_df (pd.DataFrame) – Features for training dataset
y_train_df (pd.DataFrame) – Target(s) for training dataset
x_predict_df (pd.DataFrame) – Prediction dataset (required unless test_size used to take test set from training set)
test_size (Union[float, int]) – Float (0, 1) for proportion of training dataset to use for testing; int for number of training rows to use for testing; otherwise None to specify prediction set manually in x_predict_df
cv_when_training (bool) – True to cross-validate when training models
control_features (list[str]) – List of features that should be used as controls (to transform all other features into their residuals, once variation from these features is controlled out via OLS)
n_jobs (int) – Number of parallel jobs to run during cross-validation (-1 for as many jobs as CPU’s, -2 to leave one CPU free, etc.)
random_state (Union[int, np.random.RandomState]) – Fixed random state for reproducible results, otherwise None for random execution
categorical_features (list[str]) – List of feature names to force to categorical type (“other”), regardless of how they auto-detect (e.g., for categorical features that might auto-classify as numeric), otherwise None
verbose (bool) – True to report verbose results with print() calls
calibration_method (str) – ‘isotonic’ or ‘sigmoid’ to perform probability calibration, otherwise None
threshold (str) – Threshold to use for decision boundary: ‘default’ to let classifiers default to 0.5; ‘optimal_f’ to automatically choose based on what maximizes the F score in the training set (defaults to F1 score, but you can specify a beta value to use in threshold_value); ‘optimal_j’ to automatically choose based on what maximizes Youden’s J statistic in the training set; ‘fixed’ to set to a fixed threshold as specified in threshold_value; ‘target’ to set in order to classify threshold_value*100% as positive
threshold_value (float) – Value to use if threshold is ‘fixed’ or ‘target’ (should be value between 0 and 1); or, optionally, for optimal_f, the beta value to use for the F score

Note: Currently, only binary classification problems are supported.

static build_nn_model(features: int, hidden_layers: int = 1, initial_units: int = 1, activation: str = 'relu', l2_regularization: bool = True, l2_factor: float = 0.001, include_dropout: bool = True, dropout_rate: float = 0.1, output_bias: Optional[float] = None) → Model

Build neural network model with fixed structure (each hidden layer with half as many units as the last).

Parameters

features (int) – Number of features for input layer
hidden_layers (int) – Number of hidden layers to include
initial_units (int) – Number of units in initial hidden layer (each additional hidden layer will have half as many as the last)
activation (str) – Activation function to use in hidden layers (e.g., ‘relu’ or ‘sigmoid’)
l2_regularization (bool) – True to include L2 regularization
l2_factor (float) – L2 regularization factor to use, if including L2 regularization
include_dropout (bool) – True to include dropout layers (starting with the input layer)
dropout_rate (float) – Dropout rate to use, if including dropout layers
output_bias (float) – Output bias to initialize with, if any

Returns

Model ready for fitting

Return type

tf.keras.models.Model

cv_for_best_hyperparameters(classifier, search_params: dict, model_scoring: str = 'f1', n_iter: int = 100) → dict

Run cross-validation process to search for best hyperparameters.

Parameters

classifier (Any) – Classifier to use for prediction (must be sklearn estimator)
search_params (dict) – Dictionary of search parameters
model_scoring (str) – Score to use for model evaluation (e.g., ‘f1’ or ‘neg_brier_score’)
n_iter (int) – Number of random CV iterations to attempt, during the search

Returns

Best parameters found in search

Return type

dict

report_feature_importance(importance_array: ndarray)

Report feature importance.

Parameters: importance_array (np.ndarray) – The appropriate importance array, depending on the classifier (note: use the fitted_estimator attribute to access the fitted model, in the case of calibration)

report_prediction_results(): Report out on prediction results (after run_prediction_model()).

run_prediction_model(classifier, supports_cv: bool = True)

Execute a classification model.

Parameters

classifier (Any) – Classifier to use for prediction (must be sklearn estimator)
supports_cv (bool) – False if the classifier doesn’t support cross-validation (with scores including ‘accuracy’, ‘precision’, ‘f1’, ‘roc_auc’, ‘neg_log_loss’, and ‘neg_brier_score’)

Returns

Predicted classifications for the prediction set

Return type

Any

In addition to the predictions that are returned, the following results can be found in member variables:

result_y_train_predicted - Predicted classifications for the training set
result_y_predict_predicted - Predicted classifications for the prediction set
result_y_predict_predicted_proba - Predicted probabilities for the prediction set
result_train_accuracy - Accuracy predicting within training set
result_train_precision - Precision predicting within training set
result_train_f1 - F1 score for predictions within training set
result_train_roc_auc - ROC AUC score for predictions within training set
result_predict_accuracy - Accuracy predicting within prediction set
result_predict_precision - Precision predicting within prediction set
result_predict_f1 - F1 score for predictions within prediction set
result_predict_roc_auc - ROC AUC score for predictions within prediction set
result_cv_scores - Cross-validation scores

class ml4qc.surveymlclassifier.ThresholdClassifier(classifier, threshold: float = 0.5)

Bases: BaseEstimator, ClassifierMixin

Wrapper for classifiers, to support custom decision threshold (which is inclusive: predicted probabilities at the threshold predict as positive).

Note: Currently, only binary classification problems are supported.

__init__(classifier, threshold: float = 0.5)

Initialize classifier wrapper.

Parameters

classifier (Any) – Classifier to wrap
threshold (float) – Probability threshold to use for binary classification

decision_function(X): Return decision scores (in this case, probabilities shifted down by the threshold so that negative values are predicted as 0 and positive values are predicted as 1).

fit(X, y): Fit classifier to data.

predict(X): Return binary predictions.

predict_proba(X): Return prediction probabilities.

set_score_request(*, sample_weight: Union[bool, None, str] = '$UNCHANGED$') → ThresholdClassifier

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in score.

Returns

selfobject: The updated object.