ml4qc.surveymlclassifier module

Module for using machine learning classification techniques on survey data.

class ml4qc.surveymlclassifier.SurveyMLClassifier(x_train_df: DataFrame, y_train_df: DataFrame, x_predict_df: Optional[DataFrame] = None, test_size: Optional[Union[float, int]] = None, cv_when_training: bool = False, control_features: Optional[list[str]] = None, n_jobs: int = -2, random_state: Optional[Union[int, RandomState]] = None, categorical_features: Optional[list[str]] = None, verbose: Optional[bool] = None, calibration_method: Optional[str] = None, threshold: str = 'default', threshold_value: Optional[float] = None)

Bases: SurveyML

Class for using machine learning classification techniques on survey data.

__init__(x_train_df: DataFrame, y_train_df: DataFrame, x_predict_df: Optional[DataFrame] = None, test_size: Optional[Union[float, int]] = None, cv_when_training: bool = False, control_features: Optional[list[str]] = None, n_jobs: int = -2, random_state: Optional[Union[int, RandomState]] = None, categorical_features: Optional[list[str]] = None, verbose: Optional[bool] = None, calibration_method: Optional[str] = None, threshold: str = 'default', threshold_value: Optional[float] = None)

Initialize survey data for classification using machine learning techniques.

Parameters
  • x_train_df (pd.DataFrame) – Features for training dataset

  • y_train_df (pd.DataFrame) – Target(s) for training dataset

  • x_predict_df (pd.DataFrame) – Prediction dataset (required unless test_size used to take test set from training set)

  • test_size (Union[float, int]) – Float (0, 1) for proportion of training dataset to use for testing; int for number of training rows to use for testing; otherwise None to specify prediction set manually in x_predict_df

  • cv_when_training (bool) – True to cross-validate when training models

  • control_features (list[str]) – List of features that should be used as controls (to transform all other features into their residuals, once variation from these features is controlled out via OLS)

  • n_jobs (int) – Number of parallel jobs to run during cross-validation (-1 for as many jobs as CPU’s, -2 to leave one CPU free, etc.)

  • random_state (Union[int, np.random.RandomState]) – Fixed random state for reproducible results, otherwise None for random execution

  • categorical_features (list[str]) – List of feature names to force to categorical type (“other”), regardless of how they auto-detect (e.g., for categorical features that might auto-classify as numeric), otherwise None

  • verbose (bool) – True to report verbose results with print() calls

  • calibration_method (str) – ‘isotonic’ or ‘sigmoid’ to perform probability calibration, otherwise None

  • threshold (str) – Threshold to use for decision boundary: ‘default’ to let classifiers default to 0.5; ‘optimal_f’ to automatically choose based on what maximizes the F score in the training set (defaults to F1 score, but you can specify a beta value to use in threshold_value); ‘optimal_j’ to automatically choose based on what maximizes Youden’s J statistic in the training set; ‘fixed’ to set to a fixed threshold as specified in threshold_value; ‘target’ to set in order to classify threshold_value*100% as positive

  • threshold_value (float) – Value to use if threshold is ‘fixed’ or ‘target’ (should be value between 0 and 1); or, optionally, for optimal_f, the beta value to use for the F score

Note: Currently, only binary classification problems are supported.

static build_nn_model(features: int, hidden_layers: int = 1, initial_units: int = 1, activation: str = 'relu', l2_regularization: bool = True, l2_factor: float = 0.001, include_dropout: bool = True, dropout_rate: float = 0.1, output_bias: Optional[float] = None) Model

Build neural network model with fixed structure (each hidden layer with half as many units as the last).

Parameters
  • features (int) – Number of features for input layer

  • hidden_layers (int) – Number of hidden layers to include

  • initial_units (int) – Number of units in initial hidden layer (each additional hidden layer will have half as many as the last)

  • activation (str) – Activation function to use in hidden layers (e.g., ‘relu’ or ‘sigmoid’)

  • l2_regularization (bool) – True to include L2 regularization

  • l2_factor (float) – L2 regularization factor to use, if including L2 regularization

  • include_dropout (bool) – True to include dropout layers (starting with the input layer)

  • dropout_rate (float) – Dropout rate to use, if including dropout layers

  • output_bias (float) – Output bias to initialize with, if any

Returns

Model ready for fitting

Return type

tf.keras.models.Model

cv_for_best_hyperparameters(classifier, search_params: dict, model_scoring: str = 'f1', n_iter: int = 100) dict

Run cross-validation process to search for best hyperparameters.

Parameters
  • classifier (Any) – Classifier to use for prediction (must be sklearn estimator)

  • search_params (dict) – Dictionary of search parameters

  • model_scoring (str) – Score to use for model evaluation (e.g., ‘f1’ or ‘neg_brier_score’)

  • n_iter (int) – Number of random CV iterations to attempt, during the search

Returns

Best parameters found in search

Return type

dict

report_feature_importance(importance_array: ndarray)

Report feature importance.

Parameters

importance_array (np.ndarray) – The appropriate importance array, depending on the classifier (note: use the fitted_estimator attribute to access the fitted model, in the case of calibration)

report_prediction_results()

Report out on prediction results (after run_prediction_model()).

run_prediction_model(classifier, supports_cv: bool = True)

Execute a classification model.

Parameters
  • classifier (Any) – Classifier to use for prediction (must be sklearn estimator)

  • supports_cv (bool) – False if the classifier doesn’t support cross-validation (with scores including ‘accuracy’, ‘precision’, ‘f1’, ‘roc_auc’, ‘neg_log_loss’, and ‘neg_brier_score’)

Returns

Predicted classifications for the prediction set

Return type

Any

In addition to the predictions that are returned, the following results can be found in member variables:

  • result_y_train_predicted - Predicted classifications for the training set

  • result_y_predict_predicted - Predicted classifications for the prediction set

  • result_y_predict_predicted_proba - Predicted probabilities for the prediction set

  • result_train_accuracy - Accuracy predicting within training set

  • result_train_precision - Precision predicting within training set

  • result_train_f1 - F1 score for predictions within training set

  • result_train_roc_auc - ROC AUC score for predictions within training set

  • result_predict_accuracy - Accuracy predicting within prediction set

  • result_predict_precision - Precision predicting within prediction set

  • result_predict_f1 - F1 score for predictions within prediction set

  • result_predict_roc_auc - ROC AUC score for predictions within prediction set

  • result_cv_scores - Cross-validation scores

class ml4qc.surveymlclassifier.ThresholdClassifier(classifier, threshold: float = 0.5)

Bases: BaseEstimator, ClassifierMixin

Wrapper for classifiers, to support custom decision threshold (which is inclusive: predicted probabilities at the threshold predict as positive).

Note: Currently, only binary classification problems are supported.

__init__(classifier, threshold: float = 0.5)

Initialize classifier wrapper.

Parameters
  • classifier (Any) – Classifier to wrap

  • threshold (float) – Probability threshold to use for binary classification

decision_function(X)

Return decision scores (in this case, probabilities shifted down by the threshold so that negative values are predicted as 0 and positive values are predicted as 1).

fit(X, y)

Fit classifier to data.

predict(X)

Return binary predictions.

predict_proba(X)

Return prediction probabilities.

set_score_request(*, sample_weight: Union[bool, None, str] = '$UNCHANGED$') ThresholdClassifier

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for sample_weight parameter in score.

Returns

selfobject

The updated object.