ml4qc.surveyml module

Base module for using machine learning techniques on survey data.

class ml4qc.surveyml.OLSControlTransformer(fit_for_each_transform: bool = False)

Bases: BaseEstimator, TransformerMixin

OLS control transformer, for controlling out expected variation and transforming into residuals.

__init__(fit_for_each_transform: bool = False)

Initialize OLS control transformer.

Parameters

fit_for_each_transform (bool) – True to fit model for each call to transform()

fit(X, y=None)

Fit OLS control model by transforming features into their residuals, after controlling for expected variation.

Parameters
  • X (Any) – Features to transform

  • y (Any) – (Unused)

Returns

Estimator instance for transformation (self)

Return type

Any

set_control_features(x_controls_df: DataFrame)

Set control features for OLS control transformer (must call before calling fit() or transform()).

Parameters

x_controls_df (pd.DataFrame) – Pandas DataFrame with control features for each row

transform(X, y=None)

Transform features into their residuals, after controlling for expected variation.

Parameters
  • X (Any) – Features to transform

  • y (Any) – (Unused)

Returns

Transformed features

Return type

Any

class ml4qc.surveyml.SurveyML(x_train_df: DataFrame, y_train_df: DataFrame, x_predict_df: Optional[DataFrame] = None, test_size: Optional[Union[float, int]] = None, cv_when_training: bool = False, control_features: Optional[list[str]] = None, n_jobs: int = -2, random_state: Optional[Union[int, RandomState]] = None, categorical_features: Optional[list[str]] = None, verbose: Optional[bool] = None)

Bases: object

Base class for using machine learning techniques on survey data.

__init__(x_train_df: DataFrame, y_train_df: DataFrame, x_predict_df: Optional[DataFrame] = None, test_size: Optional[Union[float, int]] = None, cv_when_training: bool = False, control_features: Optional[list[str]] = None, n_jobs: int = -2, random_state: Optional[Union[int, RandomState]] = None, categorical_features: Optional[list[str]] = None, verbose: Optional[bool] = None)

Initialize survey data for machine learning.

Parameters
  • x_train_df (pd.DataFrame) – Features for training dataset

  • y_train_df (pd.DataFrame) – Target(s) for training dataset

  • x_predict_df (pd.DataFrame) – Prediction dataset (required unless test_size used to take test set from training set)

  • test_size (Union[float, int]) – Float (0, 1) for proportion of training dataset to use for testing; int for number of training rows to use for testing; otherwise None to specify prediction set manually in x_predict_df

  • cv_when_training (bool) – True to cross-validate when training models

  • control_features (list[str]) – List of features that should be used as controls (to transform all other features into their residuals, once variation from these features is controlled out via OLS)

  • n_jobs (int) – Number of parallel jobs to run during cross-validation (-1 for as many jobs as CPU’s, -2 to leave one CPU free)

  • random_state (Union[int, RandomState]) – Fixed random state for reproducible results, otherwise None for random execution

  • categorical_features (list[str]) – List of feature names to force to categorical type (“other”), regardless of how they auto-detect (e.g., for categorical features that might auto-classify as numeric), otherwise None

  • verbose (bool) – True to report verbose results with print() calls

benchmark_by_category(category_df: DataFrame, benchmark_categories: list[str], method: str = 'knn', n_nearest_neighbors: int = 10, reg_strength: float = 0.0001, variance_to_retain: float = 1.0) DataFrame

Benchmark by category (e.g., by enumerator) using the full dataset (training+prediction together). Uses classification method to score observations as more or less like the identified category or categories to benchmark against (e.g., one or more star enumerators).

Parameters
  • category_df (pd.DataFrame) – Category column for benchmarking, in a Pandas DataFrame indexed with the same index as the training and prediction data used to initialize the object

  • benchmark_categories (list[str]) – List of specific categories to benchmark against (e.g., one or more star enumerator IDs, if benchmarking by enumerator)

  • method (str) – Method to use for scoring (‘knn’ for K nearest neighbors, ‘logistic’ for logistic regression)

  • n_nearest_neighbors (int) – If method is ‘knn’, number of nearest neighbors to consider (not including self); the largest this is, the more it skews toward categories with more observations in the dataset

  • reg_strength (float) – If method is ‘logistic’, C value for regularization strength to use for L2 regularization; given that we’re classifying within the training set, a larger value will tend toward a perfect fit

  • variance_to_retain (float) – Percent variance to retain, with value between 0 and 1 to use PCA for dimensionality reduction and 1.0 to use all features

Returns

DataFrame with category-specific scores, sorted highest first

Return type

pd.DataFrame

classify_by_category(category_df: DataFrame, method: str = 'knn', n_nearest_neighbors: int = 10, reg_strength: float = 0.0001, variance_to_retain: float = 1.0) DataFrame

Classify by category (e.g., by enumerator) using the full dataset (training+prediction together).

Parameters
  • category_df (pd.DataFrame) – Category column for classification, in a Pandas DataFrame indexed with the same index as the training and prediction data used to initialize the object

  • method (str) – Method to use for classification (‘knn’ for K nearest neighbors, ‘logistic’ for logistic regression)

  • n_nearest_neighbors – If method is ‘knn’, number of nearest neighbors to consider (not including self); the largest this is, the more it skews toward categories with more observations in the dataset

  • reg_strength (float) – If method is ‘logistic’, C value for regularization strength to use for L2 regularization; given that we’re classifying within the training set, a larger value will tend toward a perfect fit

  • variance_to_retain (float) – Percent variance to retain, with value between 0 and 1 to use PCA for dimensionality reduction and 1.0 to use all features

Returns

DataFrame with category predictions for each observation in the dataset

Return type

pd.DataFrame

static columns_by_type(df: DataFrame, force_to_other: Optional[list[str]] = None) dict

Get DataFrame columns by data type.

Parameters
  • df (pd.DataFrame) – DataFrame with columns to categorize

  • force_to_other (list[str]) – List of column names to force to “other” (e.g., for categorical columns that might auto-classify as numeric), otherwise None

Returns

Dictionary with six lists of column names: “numeric”, “numeric_binary”, “numeric_unit_interval”, “numeric_other”, “datetime”, “other”

Return type

dict

features_by_type(force_to_other: Optional[list[str]] = None) dict

Get features by data type.

Parameters

force_to_other (list[str]) – List of feature names to force to “other” (e.g., for categorical features that might auto-classify as numeric), otherwise None

Returns

Dictionary with six lists of column names: “numeric”, “numeric_binary”, “numeric_unit_interval”, “numeric_other”, “datetime”, “other”

Return type

dict

identify_clusters(min_clusters: int = 2, max_clusters: int = 10, constrain_cluster_size: bool = False, variance_to_retain: float = 1.0, separate_outliers: bool = True) DataFrame

Identify clusters in the full dataset (training+prediction together).

Parameters
  • min_clusters (int) – Minimum number of clusters

  • max_clusters (int) – Maximum number of clusters

  • constrain_cluster_size (bool) – True to constrain cluster size such that clusters are at least 1/2 the average size and at most 2x the average size

  • variance_to_retain (float) – Percent variance to retain, with value between 0 and 1 to use PCA for dimensionality reduction and 1.0 to use all features

  • separate_outliers (bool) – True to separate outliers into their own cluster (can help to better define other clusters)

Returns

DataFrame with a cluster column that identifies clusters, indexed with the same index as the training and/or prediction data

Return type

pd.DataFrame

identify_outliers(contamination: Optional[float] = None) DataFrame

Identify outliers in the full dataset (training+prediction together).

Parameters

contamination (float) – Proportion (0,1) of dataset that should be considered an outlier, or None for auto

Returns

DataFrame with an is_outlier column that is True for outliers and False otherwise

Return type

pd.DataFrame

static inverse_distance_ignoring_closest(distances)

Distance weighting function that sets closest distance to 0.0 and otherwise inverses distances. Useful for cases where you’re using nearest-neighbor methods but predicting from the training set (in which the closest match will be yourself).

Parameters

distances (Any) – Array of distances (assumed 1D or 2D)

Returns

Array of weights with 0.0 for smallest distance in each set, otherwise inverses of weights (same shape as the distances array passed in)

Return type

Any

preprocess_for_prediction(pca: Optional[float] = None, custom_pipeline: Optional[Pipeline] = None)

Preprocess data for prediction.

Parameters
  • pca (float) – If not None, float between 0 and 1 for amount of variance to retain via PCA dimensionality reduction

  • custom_pipeline (Pipeline) – Custom preprocessing pipeline, if any (overrides pca parameter)