ml4qc.surveyml module
Base module for using machine learning techniques on survey data.
- class ml4qc.surveyml.OLSControlTransformer(fit_for_each_transform: bool = False)
Bases:
BaseEstimator,TransformerMixinOLS control transformer, for controlling out expected variation and transforming into residuals.
- __init__(fit_for_each_transform: bool = False)
Initialize OLS control transformer.
- Parameters
fit_for_each_transform (bool) – True to fit model for each call to transform()
- fit(X, y=None)
Fit OLS control model by transforming features into their residuals, after controlling for expected variation.
- Parameters
X (Any) – Features to transform
y (Any) – (Unused)
- Returns
Estimator instance for transformation (self)
- Return type
Any
- set_control_features(x_controls_df: DataFrame)
Set control features for OLS control transformer (must call before calling fit() or transform()).
- Parameters
x_controls_df (pd.DataFrame) – Pandas DataFrame with control features for each row
- transform(X, y=None)
Transform features into their residuals, after controlling for expected variation.
- Parameters
X (Any) – Features to transform
y (Any) – (Unused)
- Returns
Transformed features
- Return type
Any
- class ml4qc.surveyml.SurveyML(x_train_df: DataFrame, y_train_df: DataFrame, x_predict_df: Optional[DataFrame] = None, test_size: Optional[Union[float, int]] = None, cv_when_training: bool = False, control_features: Optional[list[str]] = None, n_jobs: int = -2, random_state: Optional[Union[int, RandomState]] = None, categorical_features: Optional[list[str]] = None, verbose: Optional[bool] = None)
Bases:
objectBase class for using machine learning techniques on survey data.
- __init__(x_train_df: DataFrame, y_train_df: DataFrame, x_predict_df: Optional[DataFrame] = None, test_size: Optional[Union[float, int]] = None, cv_when_training: bool = False, control_features: Optional[list[str]] = None, n_jobs: int = -2, random_state: Optional[Union[int, RandomState]] = None, categorical_features: Optional[list[str]] = None, verbose: Optional[bool] = None)
Initialize survey data for machine learning.
- Parameters
x_train_df (pd.DataFrame) – Features for training dataset
y_train_df (pd.DataFrame) – Target(s) for training dataset
x_predict_df (pd.DataFrame) – Prediction dataset (required unless test_size used to take test set from training set)
test_size (Union[float, int]) – Float (0, 1) for proportion of training dataset to use for testing; int for number of training rows to use for testing; otherwise None to specify prediction set manually in x_predict_df
cv_when_training (bool) – True to cross-validate when training models
control_features (list[str]) – List of features that should be used as controls (to transform all other features into their residuals, once variation from these features is controlled out via OLS)
n_jobs (int) – Number of parallel jobs to run during cross-validation (-1 for as many jobs as CPU’s, -2 to leave one CPU free)
random_state (Union[int, RandomState]) – Fixed random state for reproducible results, otherwise None for random execution
categorical_features (list[str]) – List of feature names to force to categorical type (“other”), regardless of how they auto-detect (e.g., for categorical features that might auto-classify as numeric), otherwise None
verbose (bool) – True to report verbose results with print() calls
- benchmark_by_category(category_df: DataFrame, benchmark_categories: list[str], method: str = 'knn', n_nearest_neighbors: int = 10, reg_strength: float = 0.0001, variance_to_retain: float = 1.0) DataFrame
Benchmark by category (e.g., by enumerator) using the full dataset (training+prediction together). Uses classification method to score observations as more or less like the identified category or categories to benchmark against (e.g., one or more star enumerators).
- Parameters
category_df (pd.DataFrame) – Category column for benchmarking, in a Pandas DataFrame indexed with the same index as the training and prediction data used to initialize the object
benchmark_categories (list[str]) – List of specific categories to benchmark against (e.g., one or more star enumerator IDs, if benchmarking by enumerator)
method (str) – Method to use for scoring (‘knn’ for K nearest neighbors, ‘logistic’ for logistic regression)
n_nearest_neighbors (int) – If method is ‘knn’, number of nearest neighbors to consider (not including self); the largest this is, the more it skews toward categories with more observations in the dataset
reg_strength (float) – If method is ‘logistic’, C value for regularization strength to use for L2 regularization; given that we’re classifying within the training set, a larger value will tend toward a perfect fit
variance_to_retain (float) – Percent variance to retain, with value between 0 and 1 to use PCA for dimensionality reduction and 1.0 to use all features
- Returns
DataFrame with category-specific scores, sorted highest first
- Return type
pd.DataFrame
- classify_by_category(category_df: DataFrame, method: str = 'knn', n_nearest_neighbors: int = 10, reg_strength: float = 0.0001, variance_to_retain: float = 1.0) DataFrame
Classify by category (e.g., by enumerator) using the full dataset (training+prediction together).
- Parameters
category_df (pd.DataFrame) – Category column for classification, in a Pandas DataFrame indexed with the same index as the training and prediction data used to initialize the object
method (str) – Method to use for classification (‘knn’ for K nearest neighbors, ‘logistic’ for logistic regression)
n_nearest_neighbors – If method is ‘knn’, number of nearest neighbors to consider (not including self); the largest this is, the more it skews toward categories with more observations in the dataset
reg_strength (float) – If method is ‘logistic’, C value for regularization strength to use for L2 regularization; given that we’re classifying within the training set, a larger value will tend toward a perfect fit
variance_to_retain (float) – Percent variance to retain, with value between 0 and 1 to use PCA for dimensionality reduction and 1.0 to use all features
- Returns
DataFrame with category predictions for each observation in the dataset
- Return type
pd.DataFrame
- static columns_by_type(df: DataFrame, force_to_other: Optional[list[str]] = None) dict
Get DataFrame columns by data type.
- Parameters
df (pd.DataFrame) – DataFrame with columns to categorize
force_to_other (list[str]) – List of column names to force to “other” (e.g., for categorical columns that might auto-classify as numeric), otherwise None
- Returns
Dictionary with six lists of column names: “numeric”, “numeric_binary”, “numeric_unit_interval”, “numeric_other”, “datetime”, “other”
- Return type
dict
- features_by_type(force_to_other: Optional[list[str]] = None) dict
Get features by data type.
- Parameters
force_to_other (list[str]) – List of feature names to force to “other” (e.g., for categorical features that might auto-classify as numeric), otherwise None
- Returns
Dictionary with six lists of column names: “numeric”, “numeric_binary”, “numeric_unit_interval”, “numeric_other”, “datetime”, “other”
- Return type
dict
- identify_clusters(min_clusters: int = 2, max_clusters: int = 10, constrain_cluster_size: bool = False, variance_to_retain: float = 1.0, separate_outliers: bool = True) DataFrame
Identify clusters in the full dataset (training+prediction together).
- Parameters
min_clusters (int) – Minimum number of clusters
max_clusters (int) – Maximum number of clusters
constrain_cluster_size (bool) – True to constrain cluster size such that clusters are at least 1/2 the average size and at most 2x the average size
variance_to_retain (float) – Percent variance to retain, with value between 0 and 1 to use PCA for dimensionality reduction and 1.0 to use all features
separate_outliers (bool) – True to separate outliers into their own cluster (can help to better define other clusters)
- Returns
DataFrame with a cluster column that identifies clusters, indexed with the same index as the training and/or prediction data
- Return type
pd.DataFrame
- identify_outliers(contamination: Optional[float] = None) DataFrame
Identify outliers in the full dataset (training+prediction together).
- Parameters
contamination (float) – Proportion (0,1) of dataset that should be considered an outlier, or None for auto
- Returns
DataFrame with an is_outlier column that is True for outliers and False otherwise
- Return type
pd.DataFrame
- static inverse_distance_ignoring_closest(distances)
Distance weighting function that sets closest distance to 0.0 and otherwise inverses distances. Useful for cases where you’re using nearest-neighbor methods but predicting from the training set (in which the closest match will be yourself).
- Parameters
distances (Any) – Array of distances (assumed 1D or 2D)
- Returns
Array of weights with 0.0 for smallest distance in each set, otherwise inverses of weights (same shape as the distances array passed in)
- Return type
Any
- preprocess_for_prediction(pca: Optional[float] = None, custom_pipeline: Optional[Pipeline] = None)
Preprocess data for prediction.
- Parameters
pca (float) – If not None, float between 0 and 1 for amount of variance to retain via PCA dimensionality reduction
custom_pipeline (Pipeline) – Custom preprocessing pipeline, if any (overrides pca parameter)