ml4qc.surveyml module

Base module for using machine learning techniques on survey data.

class ml4qc.surveyml.OLSControlTransformer(fit_for_each_transform: bool = False)

Bases: BaseEstimator, TransformerMixin

OLS control transformer, for controlling out expected variation and transforming into residuals.

__init__(fit_for_each_transform: bool = False)

Initialize OLS control transformer.

Parameters: fit_for_each_transform (bool) – True to fit model for each call to transform()

fit(X, y=None)

Fit OLS control model by transforming features into their residuals, after controlling for expected variation.

Parameters

X (Any) – Features to transform
y (Any) – (Unused)

Returns

Estimator instance for transformation (self)

Return type

Any

set_control_features(x_controls_df: DataFrame)

Set control features for OLS control transformer (must call before calling fit() or transform()).

Parameters: x_controls_df (pd.DataFrame) – Pandas DataFrame with control features for each row

transform(X, y=None)

Transform features into their residuals, after controlling for expected variation.

Parameters

X (Any) – Features to transform
y (Any) – (Unused)

Returns

Transformed features

Return type

Any

class ml4qc.surveyml.SurveyML(x_train_df: DataFrame, y_train_df: DataFrame, x_predict_df: Optional[DataFrame] = None, test_size: Optional[Union[float, int]] = None, cv_when_training: bool = False, control_features: Optional[list[str]] = None, n_jobs: int = -2, random_state: Optional[Union[int, RandomState]] = None, categorical_features: Optional[list[str]] = None, verbose: Optional[bool] = None)

Bases: object

Base class for using machine learning techniques on survey data.

__init__(x_train_df: DataFrame, y_train_df: DataFrame, x_predict_df: Optional[DataFrame] = None, test_size: Optional[Union[float, int]] = None, cv_when_training: bool = False, control_features: Optional[list[str]] = None, n_jobs: int = -2, random_state: Optional[Union[int, RandomState]] = None, categorical_features: Optional[list[str]] = None, verbose: Optional[bool] = None)

Initialize survey data for machine learning.

Parameters

x_train_df (pd.DataFrame) – Features for training dataset
y_train_df (pd.DataFrame) – Target(s) for training dataset
x_predict_df (pd.DataFrame) – Prediction dataset (required unless test_size used to take test set from training set)
test_size (Union[float, int]) – Float (0, 1) for proportion of training dataset to use for testing; int for number of training rows to use for testing; otherwise None to specify prediction set manually in x_predict_df
cv_when_training (bool) – True to cross-validate when training models
control_features (list[str]) – List of features that should be used as controls (to transform all other features into their residuals, once variation from these features is controlled out via OLS)
n_jobs (int) – Number of parallel jobs to run during cross-validation (-1 for as many jobs as CPU’s, -2 to leave one CPU free)
random_state (Union[int, RandomState]) – Fixed random state for reproducible results, otherwise None for random execution
categorical_features (list[str]) – List of feature names to force to categorical type (“other”), regardless of how they auto-detect (e.g., for categorical features that might auto-classify as numeric), otherwise None
verbose (bool) – True to report verbose results with print() calls

benchmark_by_category(category_df: DataFrame, benchmark_categories: list[str], method: str = 'knn', n_nearest_neighbors: int = 10, reg_strength: float = 0.0001, variance_to_retain: float = 1.0) → DataFrame

Benchmark by category (e.g., by enumerator) using the full dataset (training+prediction together). Uses classification method to score observations as more or less like the identified category or categories to benchmark against (e.g., one or more star enumerators).

Parameters

category_df (pd.DataFrame) – Category column for benchmarking, in a Pandas DataFrame indexed with the same index as the training and prediction data used to initialize the object
benchmark_categories (list[str]) – List of specific categories to benchmark against (e.g., one or more star enumerator IDs, if benchmarking by enumerator)
method (str) – Method to use for scoring (‘knn’ for K nearest neighbors, ‘logistic’ for logistic regression)
n_nearest_neighbors (int) – If method is ‘knn’, number of nearest neighbors to consider (not including self); the largest this is, the more it skews toward categories with more observations in the dataset
reg_strength (float) – If method is ‘logistic’, C value for regularization strength to use for L2 regularization; given that we’re classifying within the training set, a larger value will tend toward a perfect fit
variance_to_retain (float) – Percent variance to retain, with value between 0 and 1 to use PCA for dimensionality reduction and 1.0 to use all features

Returns

DataFrame with category-specific scores, sorted highest first

Return type

pd.DataFrame

classify_by_category(category_df: DataFrame, method: str = 'knn', n_nearest_neighbors: int = 10, reg_strength: float = 0.0001, variance_to_retain: float = 1.0) → DataFrame

Classify by category (e.g., by enumerator) using the full dataset (training+prediction together).

Parameters

category_df (pd.DataFrame) – Category column for classification, in a Pandas DataFrame indexed with the same index as the training and prediction data used to initialize the object
method (str) – Method to use for classification (‘knn’ for K nearest neighbors, ‘logistic’ for logistic regression)
n_nearest_neighbors – If method is ‘knn’, number of nearest neighbors to consider (not including self); the largest this is, the more it skews toward categories with more observations in the dataset
reg_strength (float) – If method is ‘logistic’, C value for regularization strength to use for L2 regularization; given that we’re classifying within the training set, a larger value will tend toward a perfect fit
variance_to_retain (float) – Percent variance to retain, with value between 0 and 1 to use PCA for dimensionality reduction and 1.0 to use all features

Returns

DataFrame with category predictions for each observation in the dataset

Return type

pd.DataFrame

static columns_by_type(df: DataFrame, force_to_other: Optional[list[str]] = None) → dict

Get DataFrame columns by data type.

Parameters

df (pd.DataFrame) – DataFrame with columns to categorize
force_to_other (list[str]) – List of column names to force to “other” (e.g., for categorical columns that might auto-classify as numeric), otherwise None

Returns

Dictionary with six lists of column names: “numeric”, “numeric_binary”, “numeric_unit_interval”, “numeric_other”, “datetime”, “other”

Return type

dict

features_by_type(force_to_other: Optional[list[str]] = None) → dict

Get features by data type.

Parameters: force_to_other (list[str]) – List of feature names to force to “other” (e.g., for categorical features that might auto-classify as numeric), otherwise None
Returns: Dictionary with six lists of column names: “numeric”, “numeric_binary”, “numeric_unit_interval”, “numeric_other”, “datetime”, “other”
Return type: dict

identify_clusters(min_clusters: int = 2, max_clusters: int = 10, constrain_cluster_size: bool = False, variance_to_retain: float = 1.0, separate_outliers: bool = True) → DataFrame

Identify clusters in the full dataset (training+prediction together).

Parameters

min_clusters (int) – Minimum number of clusters
max_clusters (int) – Maximum number of clusters
constrain_cluster_size (bool) – True to constrain cluster size such that clusters are at least 1/2 the average size and at most 2x the average size
variance_to_retain (float) – Percent variance to retain, with value between 0 and 1 to use PCA for dimensionality reduction and 1.0 to use all features
separate_outliers (bool) – True to separate outliers into their own cluster (can help to better define other clusters)

Returns

DataFrame with a cluster column that identifies clusters, indexed with the same index as the training and/or prediction data

Return type

pd.DataFrame

identify_outliers(contamination: Optional[float] = None) → DataFrame

Identify outliers in the full dataset (training+prediction together).

Parameters: contamination (float) – Proportion (0,1) of dataset that should be considered an outlier, or None for auto
Returns: DataFrame with an is_outlier column that is True for outliers and False otherwise
Return type: pd.DataFrame

static inverse_distance_ignoring_closest(distances)

Distance weighting function that sets closest distance to 0.0 and otherwise inverses distances. Useful for cases where you’re using nearest-neighbor methods but predicting from the training set (in which the closest match will be yourself).

Parameters: distances (Any) – Array of distances (assumed 1D or 2D)
Returns: Array of weights with 0.0 for smallest distance in each set, otherwise inverses of weights (same shape as the distances array passed in)
Return type: Any

preprocess_for_prediction(pca: Optional[float] = None, custom_pipeline: Optional[Pipeline] = None)

Preprocess data for prediction.

Parameters

pca (float) – If not None, float between 0 and 1 for amount of variance to retain via PCA dimensionality reduction
custom_pipeline (Pipeline) – Custom preprocessing pipeline, if any (overrides pca parameter)