ml4qc.surveymlclassifier module
Module for using machine learning classification techniques on survey data.
- class ml4qc.surveymlclassifier.SurveyMLClassifier(x_train_df: DataFrame, y_train_df: DataFrame, x_predict_df: Optional[DataFrame] = None, test_size: Optional[Union[float, int]] = None, cv_when_training: bool = False, control_features: Optional[list[str]] = None, n_jobs: int = -2, random_state: Optional[Union[int, RandomState]] = None, categorical_features: Optional[list[str]] = None, verbose: Optional[bool] = None, calibration_method: Optional[str] = None, threshold: str = 'default', threshold_value: Optional[float] = None)
Bases:
SurveyMLClass for using machine learning classification techniques on survey data.
- __init__(x_train_df: DataFrame, y_train_df: DataFrame, x_predict_df: Optional[DataFrame] = None, test_size: Optional[Union[float, int]] = None, cv_when_training: bool = False, control_features: Optional[list[str]] = None, n_jobs: int = -2, random_state: Optional[Union[int, RandomState]] = None, categorical_features: Optional[list[str]] = None, verbose: Optional[bool] = None, calibration_method: Optional[str] = None, threshold: str = 'default', threshold_value: Optional[float] = None)
Initialize survey data for classification using machine learning techniques.
- Parameters
x_train_df (pd.DataFrame) – Features for training dataset
y_train_df (pd.DataFrame) – Target(s) for training dataset
x_predict_df (pd.DataFrame) – Prediction dataset (required unless test_size used to take test set from training set)
test_size (Union[float, int]) – Float (0, 1) for proportion of training dataset to use for testing; int for number of training rows to use for testing; otherwise None to specify prediction set manually in x_predict_df
cv_when_training (bool) – True to cross-validate when training models
control_features (list[str]) – List of features that should be used as controls (to transform all other features into their residuals, once variation from these features is controlled out via OLS)
n_jobs (int) – Number of parallel jobs to run during cross-validation (-1 for as many jobs as CPU’s, -2 to leave one CPU free, etc.)
random_state (Union[int, np.random.RandomState]) – Fixed random state for reproducible results, otherwise None for random execution
categorical_features (list[str]) – List of feature names to force to categorical type (“other”), regardless of how they auto-detect (e.g., for categorical features that might auto-classify as numeric), otherwise None
verbose (bool) – True to report verbose results with print() calls
calibration_method (str) – ‘isotonic’ or ‘sigmoid’ to perform probability calibration, otherwise None
threshold (str) – Threshold to use for decision boundary: ‘default’ to let classifiers default to 0.5; ‘optimal_f’ to automatically choose based on what maximizes the F score in the training set (defaults to F1 score, but you can specify a beta value to use in threshold_value); ‘optimal_j’ to automatically choose based on what maximizes Youden’s J statistic in the training set; ‘fixed’ to set to a fixed threshold as specified in threshold_value; ‘target’ to set in order to classify threshold_value*100% as positive
threshold_value (float) – Value to use if threshold is ‘fixed’ or ‘target’ (should be value between 0 and 1); or, optionally, for optimal_f, the beta value to use for the F score
Note: Currently, only binary classification problems are supported.
- static build_nn_model(features: int, hidden_layers: int = 1, initial_units: int = 1, activation: str = 'relu', l2_regularization: bool = True, l2_factor: float = 0.001, include_dropout: bool = True, dropout_rate: float = 0.1, output_bias: Optional[float] = None) Model
Build neural network model with fixed structure (each hidden layer with half as many units as the last).
- Parameters
features (int) – Number of features for input layer
hidden_layers (int) – Number of hidden layers to include
initial_units (int) – Number of units in initial hidden layer (each additional hidden layer will have half as many as the last)
activation (str) – Activation function to use in hidden layers (e.g., ‘relu’ or ‘sigmoid’)
l2_regularization (bool) – True to include L2 regularization
l2_factor (float) – L2 regularization factor to use, if including L2 regularization
include_dropout (bool) – True to include dropout layers (starting with the input layer)
dropout_rate (float) – Dropout rate to use, if including dropout layers
output_bias (float) – Output bias to initialize with, if any
- Returns
Model ready for fitting
- Return type
tf.keras.models.Model
- cv_for_best_hyperparameters(classifier, search_params: dict, model_scoring: str = 'f1', n_iter: int = 100) dict
Run cross-validation process to search for best hyperparameters.
- Parameters
classifier (Any) – Classifier to use for prediction (must be sklearn estimator)
search_params (dict) – Dictionary of search parameters
model_scoring (str) – Score to use for model evaluation (e.g., ‘f1’ or ‘neg_brier_score’)
n_iter (int) – Number of random CV iterations to attempt, during the search
- Returns
Best parameters found in search
- Return type
dict
- report_feature_importance(importance_array: ndarray)
Report feature importance.
- Parameters
importance_array (np.ndarray) – The appropriate importance array, depending on the classifier (note: use the fitted_estimator attribute to access the fitted model, in the case of calibration)
- report_prediction_results()
Report out on prediction results (after run_prediction_model()).
- run_prediction_model(classifier, supports_cv: bool = True)
Execute a classification model.
- Parameters
classifier (Any) – Classifier to use for prediction (must be sklearn estimator)
supports_cv (bool) – False if the classifier doesn’t support cross-validation (with scores including ‘accuracy’, ‘precision’, ‘f1’, ‘roc_auc’, ‘neg_log_loss’, and ‘neg_brier_score’)
- Returns
Predicted classifications for the prediction set
- Return type
Any
In addition to the predictions that are returned, the following results can be found in member variables:
result_y_train_predicted - Predicted classifications for the training set
result_y_predict_predicted - Predicted classifications for the prediction set
result_y_predict_predicted_proba - Predicted probabilities for the prediction set
result_train_accuracy - Accuracy predicting within training set
result_train_precision - Precision predicting within training set
result_train_f1 - F1 score for predictions within training set
result_train_roc_auc - ROC AUC score for predictions within training set
result_predict_accuracy - Accuracy predicting within prediction set
result_predict_precision - Precision predicting within prediction set
result_predict_f1 - F1 score for predictions within prediction set
result_predict_roc_auc - ROC AUC score for predictions within prediction set
result_cv_scores - Cross-validation scores
- class ml4qc.surveymlclassifier.ThresholdClassifier(classifier, threshold: float = 0.5)
Bases:
BaseEstimator,ClassifierMixinWrapper for classifiers, to support custom decision threshold (which is inclusive: predicted probabilities at the threshold predict as positive).
Note: Currently, only binary classification problems are supported.
- __init__(classifier, threshold: float = 0.5)
Initialize classifier wrapper.
- Parameters
classifier (Any) – Classifier to wrap
threshold (float) – Probability threshold to use for binary classification
- decision_function(X)
Return decision scores (in this case, probabilities shifted down by the threshold so that negative values are predicted as 0 and positive values are predicted as 1).
- fit(X, y)
Fit classifier to data.
- predict(X)
Return binary predictions.
- predict_proba(X)
Return prediction probabilities.