ml4qc.surveymlclassifier module

Module for using machine learning classification techniques on survey data.

class ml4qc.surveymlclassifier.SurveyMLClassifier(x_train_df: DataFrame, y_train_df: DataFrame, x_predict_df: Optional[DataFrame] = None, test_size: Optional[Union[float, int]] = None, cv_when_training: bool = False, control_features: Optional[list[str]] = None, n_jobs: int = -2, random_state: Optional[Union[int, RandomState]] = None, categorical_features: Optional[list[str]] = None, verbose: Optional[bool] = None, calibration_method: Optional[str] = None, threshold: str = 'default', threshold_value: Optional[float] = None)

Bases: SurveyML

Class for using machine learning classification techniques on survey data.

__init__(x_train_df: DataFrame, y_train_df: DataFrame, x_predict_df: Optional[DataFrame] = None, test_size: Optional[Union[float, int]] = None, cv_when_training: bool = False, control_features: Optional[list[str]] = None, n_jobs: int = -2, random_state: Optional[Union[int, RandomState]] = None, categorical_features: Optional[list[str]] = None, verbose: Optional[bool] = None, calibration_method: Optional[str] = None, threshold: str = 'default', threshold_value: Optional[float] = None)

Initialize survey data for classification using machine learning techniques.

Parameters
  • x_train_df (pd.DataFrame) – Features for training dataset

  • y_train_df (pd.DataFrame) – Target(s) for training dataset

  • x_predict_df (pd.DataFrame) – Prediction dataset (required unless test_size used to take test set from training set)

  • test_size (Union[float, int]) – Float (0, 1) for proportion of training dataset to use for testing; int for number of training rows to use for testing; otherwise None to specify prediction set manually in x_predict_df

  • cv_when_training (bool) – True to cross-validate when training models

  • control_features (list[str]) – List of features that should be used as controls (to transform all other features into their residuals, once variation from these features is controlled out via OLS)

  • n_jobs (int) – Number of parallel jobs to run during cross-validation (-1 for as many jobs as CPU’s, -2 to leave one CPU free, etc.)

  • random_state (Union[int, np.random.RandomState]) – Fixed random state for reproducible results, otherwise None for random execution

  • categorical_features (list[str]) – List of feature names to force to categorical type (“other”), regardless of how they auto-detect (e.g., for categorical features that might auto-classify as numeric), otherwise None

  • verbose (bool) – True to report verbose results with print() calls

  • calibration_method (str) – ‘isotonic’ or ‘sigmoid’ to perform probability calibration, otherwise None

  • threshold (str) – Threshold to use for decision boundary: ‘default’ to let classifiers default to 0.5; ‘optimal_f’ to automatically choose based on what maximizes the F score in the training set (defaults to F1 score, but you can specify a beta value to use in threshold_value); ‘optimal_j’ to automatically choose based on what maximizes Youden’s J statistic in the training set; ‘fixed’ to set to a fixed threshold as specified in threshold_value; ‘target’ to set in order to classify threshold_value*100% as positive

  • threshold_value (float) – Value to use if threshold is ‘fixed’ or ‘target’ (should be value between 0 and 1); or, optionally, for optimal_f, the beta value to use for the F score

Note: Currently, only binary classification problems are supported.

static build_nn_model(features: int, hidden_layers: int = 1, initial_units: int = 1, activation: str = 'relu', l2_regularization: bool = True, l2_factor: float = 0.001, include_dropout: bool = True, dropout_rate: float = 0.1, output_bias: Optional[float] = None) Model

Build neural network model with fixed structure (each hidden layer with half as many units as the last).

Parameters
  • features (int) – Number of features for input layer

  • hidden_layers (int) – Number of hidden layers to include

  • initial_units (int) – Number of units in initial hidden layer (each additional hidden layer will have half as many as the last)

  • activation (str) – Activation function to use in hidden layers (e.g., ‘relu’ or ‘sigmoid’)

  • l2_regularization (bool) – True to include L2 regularization

  • l2_factor (float) – L2 regularization factor to use, if including L2 regularization

  • include_dropout (bool) – True to include dropout layers (starting with the input layer)

  • dropout_rate (float) – Dropout rate to use, if including dropout layers

  • output_bias (float) – Output bias to initialize with, if any

Returns

Model ready for fitting

Return type

tf.keras.models.Model

cv_for_best_hyperparameters(classifier, search_params: dict, model_scoring: str = 'f1', n_iter: int = 100) dict

Run cross-validation process to search for best hyperparameters.

Parameters
  • classifier (Any) – Classifier to use for prediction (must be sklearn estimator)

  • search_params (dict) – Dictionary of search parameters

  • model_scoring (str) – Score to use for model evaluation (e.g., ‘f1’ or ‘neg_brier_score’)

  • n_iter (int) – Number of random CV iterations to attempt, during the search

Returns

Best parameters found in search

Return type

dict

report_feature_importance(importance_array: ndarray)

Report feature importance.

Parameters

importance_array (np.ndarray) – The appropriate importance array, depending on the classifier (note: use the fitted_estimator attribute to access the fitted model, in the case of calibration)

report_prediction_results()

Report out on prediction results (after run_prediction_model()).

run_prediction_model(classifier, supports_cv: bool = True)

Execute a classification model.

Parameters
  • classifier (Any) – Classifier to use for prediction (must be sklearn estimator)

  • supports_cv (bool) – False if the classifier doesn’t support cross-validation (with scores including ‘accuracy’, ‘precision’, ‘f1’, ‘roc_auc’, ‘neg_log_loss’, and ‘neg_brier_score’)

Returns

Predicted classifications for the prediction set

Return type

Any

In addition to the predictions that are returned, the following results can be found in member variables:

  • result_y_train_predicted - Predicted classifications for the training set

  • result_y_predict_predicted - Predicted classifications for the prediction set

  • result_y_predict_predicted_proba - Predicted probabilities for the prediction set

  • result_train_accuracy - Accuracy predicting within training set

  • result_train_precision - Precision predicting within training set

  • result_train_f1 - F1 score for predictions within training set

  • result_train_roc_auc - ROC AUC score for predictions within training set

  • result_predict_accuracy - Accuracy predicting within prediction set

  • result_predict_precision - Precision predicting within prediction set

  • result_predict_f1 - F1 score for predictions within prediction set

  • result_predict_roc_auc - ROC AUC score for predictions within prediction set

  • result_cv_scores - Cross-validation scores

class ml4qc.surveymlclassifier.ThresholdClassifier(classifier, threshold: float = 0.5)

Bases: BaseEstimator, ClassifierMixin

Wrapper for classifiers, to support custom decision threshold (which is inclusive: predicted probabilities at the threshold predict as positive).

Note: Currently, only binary classification problems are supported.

__init__(classifier, threshold: float = 0.5)

Initialize classifier wrapper.

Parameters
  • classifier (Any) – Classifier to wrap

  • threshold (float) – Probability threshold to use for binary classification

decision_function(X)

Return decision scores (in this case, probabilities shifted down by the threshold so that negative values are predicted as 0 and positive values are predicted as 1).

fit(X, y)

Fit classifier to data.

predict(X)

Return binary predictions.

predict_proba(X)

Return prediction probabilities.