ml4qc.surveymlclassifier module
Module for using machine learning classification techniques on survey data.
- class ml4qc.surveymlclassifier.SurveyMLClassifier(x_train_df: DataFrame, y_train_df: DataFrame, x_predict_df: Optional[DataFrame] = None, test_size: Optional[Union[float, int]] = None, cv_when_training: bool = False, control_features: Optional[list[str]] = None, n_jobs: int = -2, random_state: Optional[Union[int, RandomState]] = None, categorical_features: Optional[list[str]] = None, verbose: Optional[bool] = None, calibration_method: Optional[str] = None, threshold: str = 'default', threshold_value: Optional[float] = None)
Bases:
SurveyMLClass for using machine learning classification techniques on survey data.
- __init__(x_train_df: DataFrame, y_train_df: DataFrame, x_predict_df: Optional[DataFrame] = None, test_size: Optional[Union[float, int]] = None, cv_when_training: bool = False, control_features: Optional[list[str]] = None, n_jobs: int = -2, random_state: Optional[Union[int, RandomState]] = None, categorical_features: Optional[list[str]] = None, verbose: Optional[bool] = None, calibration_method: Optional[str] = None, threshold: str = 'default', threshold_value: Optional[float] = None)
Initialize survey data for classification using machine learning techniques.
- Parameters
x_train_df (pd.DataFrame) – Features for training dataset
y_train_df (pd.DataFrame) – Target(s) for training dataset
x_predict_df (pd.DataFrame) – Prediction dataset (required unless test_size used to take test set from training set)
test_size (Union[float, int]) – Float (0, 1) for proportion of training dataset to use for testing; int for number of training rows to use for testing; otherwise None to specify prediction set manually in x_predict_df
cv_when_training (bool) – True to cross-validate when training models
control_features (list[str]) – List of features that should be used as controls (to transform all other features into their residuals, once variation from these features is controlled out via OLS)
n_jobs (int) – Number of parallel jobs to run during cross-validation (-1 for as many jobs as CPU’s, -2 to leave one CPU free, etc.)
random_state (Union[int, np.random.RandomState]) – Fixed random state for reproducible results, otherwise None for random execution
categorical_features (list[str]) – List of feature names to force to categorical type (“other”), regardless of how they auto-detect (e.g., for categorical features that might auto-classify as numeric), otherwise None
verbose (bool) – True to report verbose results with print() calls
calibration_method (str) – ‘isotonic’ or ‘sigmoid’ to perform probability calibration, otherwise None
threshold (str) – Threshold to use for decision boundary: ‘default’ to let classifiers default to 0.5; ‘optimal_f’ to automatically choose based on what maximizes the F score in the training set (defaults to F1 score, but you can specify a beta value to use in threshold_value); ‘optimal_j’ to automatically choose based on what maximizes Youden’s J statistic in the training set; ‘fixed’ to set to a fixed threshold as specified in threshold_value; ‘target’ to set in order to classify threshold_value*100% as positive
threshold_value (float) – Value to use if threshold is ‘fixed’ or ‘target’ (should be value between 0 and 1); or, optionally, for optimal_f, the beta value to use for the F score
Note: Currently, only binary classification problems are supported.
- static build_nn_model(features: int, hidden_layers: int = 1, initial_units: int = 1, activation: str = 'relu', l2_regularization: bool = True, l2_factor: float = 0.001, include_dropout: bool = True, dropout_rate: float = 0.1, output_bias: Optional[float] = None) Model
Build neural network model with fixed structure (each hidden layer with half as many units as the last).
- Parameters
features (int) – Number of features for input layer
hidden_layers (int) – Number of hidden layers to include
initial_units (int) – Number of units in initial hidden layer (each additional hidden layer will have half as many as the last)
activation (str) – Activation function to use in hidden layers (e.g., ‘relu’ or ‘sigmoid’)
l2_regularization (bool) – True to include L2 regularization
l2_factor (float) – L2 regularization factor to use, if including L2 regularization
include_dropout (bool) – True to include dropout layers (starting with the input layer)
dropout_rate (float) – Dropout rate to use, if including dropout layers
output_bias (float) – Output bias to initialize with, if any
- Returns
Model ready for fitting
- Return type
tf.keras.models.Model
- cv_for_best_hyperparameters(classifier, search_params: dict, model_scoring: str = 'f1', n_iter: int = 100) dict
Run cross-validation process to search for best hyperparameters.
- Parameters
classifier (Any) – Classifier to use for prediction (must be sklearn estimator)
search_params (dict) – Dictionary of search parameters
model_scoring (str) – Score to use for model evaluation (e.g., ‘f1’ or ‘neg_brier_score’)
n_iter (int) – Number of random CV iterations to attempt, during the search
- Returns
Best parameters found in search
- Return type
dict
- report_feature_importance(importance_array: ndarray)
Report feature importance.
- Parameters
importance_array (np.ndarray) – The appropriate importance array, depending on the classifier (note: use the fitted_estimator attribute to access the fitted model, in the case of calibration)
- report_prediction_results()
Report out on prediction results (after run_prediction_model()).
- run_prediction_model(classifier, supports_cv: bool = True)
Execute a classification model.
- Parameters
classifier (Any) – Classifier to use for prediction (must be sklearn estimator)
supports_cv (bool) – False if the classifier doesn’t support cross-validation (with scores including ‘accuracy’, ‘precision’, ‘f1’, ‘roc_auc’, ‘neg_log_loss’, and ‘neg_brier_score’)
- Returns
Predicted classifications for the prediction set
- Return type
Any
In addition to the predictions that are returned, the following results can be found in member variables:
result_y_train_predicted - Predicted classifications for the training set
result_y_predict_predicted - Predicted classifications for the prediction set
result_y_predict_predicted_proba - Predicted probabilities for the prediction set
result_train_accuracy - Accuracy predicting within training set
result_train_precision - Precision predicting within training set
result_train_f1 - F1 score for predictions within training set
result_train_roc_auc - ROC AUC score for predictions within training set
result_predict_accuracy - Accuracy predicting within prediction set
result_predict_precision - Precision predicting within prediction set
result_predict_f1 - F1 score for predictions within prediction set
result_predict_roc_auc - ROC AUC score for predictions within prediction set
result_cv_scores - Cross-validation scores
- class ml4qc.surveymlclassifier.ThresholdClassifier(classifier, threshold: float = 0.5)
Bases:
BaseEstimator,ClassifierMixinWrapper for classifiers, to support custom decision threshold (which is inclusive: predicted probabilities at the threshold predict as positive).
Note: Currently, only binary classification problems are supported.
- __init__(classifier, threshold: float = 0.5)
Initialize classifier wrapper.
- Parameters
classifier (Any) – Classifier to wrap
threshold (float) – Probability threshold to use for binary classification
- decision_function(X)
Return decision scores (in this case, probabilities shifted down by the threshold so that negative values are predicted as 0 and positive values are predicted as 1).
- fit(X, y)
Fit classifier to data.
- predict(X)
Return binary predictions.
- predict_proba(X)
Return prediction probabilities.
- set_score_request(*, sample_weight: Union[bool, None, str] = '$UNCHANGED$') ThresholdClassifier
Request metadata passed to the
scoremethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toscoreif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toscore.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.Parameters
- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weightparameter inscore.
Returns
- selfobject
The updated object.