1 Introduction
Preprocessing is a critical step in building effective machine learning models. It involves transforming, encoding, scaling, imputing, and selecting features to optimize model performance. These steps directly influence generalization, training speed, regularization, and interpretability. The choice of preprocessing techniques, such as scaling, normalization, or encoding, can significantly impact classification metrics, with different algorithms responding uniquely to the same pipeline.
Key principles for a reproducible preprocessing pipeline include:
- Preprocessing is part of the model. Use the
recipes
andworkflows
packages in R to ensure transformations are trained on the training set and consistently applied to new data, preventing data leakage or mismatched transformations during prediction. - Document the purpose of each step. Group preprocessing steps by the issues they address (e.g., skewness, scale, missingness, collinearity) and explain why specific model families require them.
This guide provides a comprehensive overview of preprocessing with the recipes
package, tailored to different model families, along with a practical baseline recipe and checklist.
2 Quick Reference Table
Below is a concise cheat-sheet for common recipes
preprocessing steps, their purposes, and the model families they benefit.
Step | Purpose | Helps | Notes |
---|---|---|---|
step_impute_median() / step_impute_mode() |
Handle missing data | All models | Apply early in the pipeline to ensure subsequent steps operate on complete data. |
step_zv() / step_nzv() |
Remove zero or near-zero variance predictors | All models | Speeds up training by eliminating uninformative features. |
step_date() |
Extract features (e.g., year, month, day) from date columns | All models | Captures predictive signals from temporal data. |
step_holiday() |
Create binary indicators for holidays | All models | Useful for datasets with seasonal or holiday effects. |
step_interact() |
Create interaction terms (e.g., age * income ) |
Linear/GLM, GAM, MARS models | Captures non-additive relationships. |
step_ns() / step_bs() |
Generate basis spline expansions | Linear/GLM, GAM, MARS models | Models non-linear relationships in linear models. |
step_YeoJohnson() / step_BoxCox() |
Reduce skewness and approximate normality | Linear/GLM, LDA/QDA, Neural Networks, Naive Bayes, PLS | Improves model stability for skewed data. |
step_other() / step_novel() |
Manage rare categories and unseen levels | All models | Prevents errors during prediction with new categories. |
step_dummy() |
Convert categorical predictors to numeric indicators | xgboost , GLMs, SVMs, Neural Networks, PLS, MaxEnt |
Required for models that expect numeric inputs. |
step_normalize() |
Center and scale predictors to mean 0, SD 1 | Penalized linear models, SVM, KNN, Neural Networks, LDA/QDA, PLS | Essential for models sensitive to feature scale. |
step_corr() / step_pca() |
Reduce dimensionality and multicollinearity | Linear/GLMs, LDA, QDA, KNN, PLS | Improves model stability and performance. |
step_discretize() |
Bin continuous variables into categorical | Tree-based models, GLMs, Naive Bayes | Simplifies relationships for certain algorithms. |
step_mutate() |
Create or transform features manually | All models | Allows custom feature engineering (e.g., ratios, log transformations). |
3 How Recipe Steps Affect Model Families
Preprocessing requirements vary by model family due to their differing sensitivities to feature scale, distribution, and structure. Below, we outline recommended steps for each family in the correct order of application.
3.1 Tree-Based and Rule-Based Classification
Behavior: Tree-based models (e.g., ranger
, xgboost
, rpart
, C5_rules
, bart
, bag_tree
, boost_tree
, rand_forest
, rule_fit
) rely on splits based on feature ordering and thresholds, making them robust to monotonic transformations and less sensitive to feature scale. However, they benefit from explicit date features and proper handling of categorical variables.
Recommended steps (in order):
- Impute missing values (
step_impute_median()
/step_impute_mode()
): Ensures complete data, as some tree-based engines (e.g.,xgboost
) can handle missing values internally, but imputation is safer for consistency. - Extract date features (
step_date()
/step_holiday()
): Creates features like day of the week or holiday indicators, which trees can use for splits. - Handle rare categories (
step_other()
/step_novel()
): Collapses infrequent categorical levels and prepares for unseen categories during prediction. - Encode categorical variables (
step_dummy()
): Converts categories to numeric indicators, required for engines likexgboost
,C5.0
, orlightgbm
, but optional forranger
,rpart
, orpartykit
. - Scale features (
step_normalize()
): Only necessary for RuleFit (xrf
) or similar hybrid models that combine trees with linear components. - Discretize continuous variables (
step_discretize()
): Optional, but can simplify relationships for certain tree-based models likeC5.0
orrpart
.
3.2 Linear and Generalized Linear Classification Models
Behavior: Linear models (e.g., glm
, glmnet
, multinom_reg
, logistic_reg
), generalized additive models (gen_additive_mod
), and multivariate adaptive regression splines (mars
, bag_mars
) are highly sensitive to feature scale and assume linear or smooth relationships. Feature engineering, such as splines and interactions, is critical for capturing complex patterns.
Recommended steps (in order):
- Impute missing values (
step_impute_median()
/step_impute_mode()
): Ensures complete data for model fitting. - Extract date features (
step_date()
/step_holiday()
): Creates linear predictors from date columns. - Add splines or interactions (
step_ns()
/step_bs()
/step_interact()
): Captures non-linear and non-additive relationships, boosting performance forglm
,glmnet
,gen_additive_mod
, andmars
. - Reduce skewness (
step_YeoJohnson()
): Normalizes skewed numeric predictors for better model stability. - Encode categorical variables (
step_dummy()
): Converts categorical predictors to numeric format forglmnet
,LiblineaR
,keras
, orspark
. - Scale features (
step_normalize()
): Ensures comparable scale for penalized models likeglmnet
ormultinom_reg
. - Address multicollinearity (
step_corr()
/step_pca()
): Reduces redundancy among correlated predictors. - Custom transformations (
step_mutate()
): Allows manual feature engineering, such as creating ratios or log transformations, especially useful forgen_additive_mod
ormars
.
3.3 Kernel-Based Models and k-Nearest Neighbors (SVM, KNN)
Behavior: Kernel-based models (svm_linear
, svm_poly
, svm_rbf
) and KNN (nearest_neighbor
) rely on distance calculations, making them highly sensitive to feature scale and outliers.
Recommended steps (in order):
- Impute missing values (
step_impute_median()
/step_impute_mode()
): Ensures complete data for distance calculations. - Extract date features (
step_date()
/step_holiday()
): Creates numeric features from dates. - Reduce skewness (
step_YeoJohnson()
): Minimizes the impact of outliers on distance metrics. - Encode categorical variables (
step_dummy()
): Converts categorical predictors to numeric format forLiblineaR
,kernlab
, orliquidSVM
. - Scale features (
step_normalize()
/step_range()
): Ensures all features are on the same scale, critical for distance-based models. - Address multicollinearity (
step_pca()
): Reduces dimensionality to improve performance.
3.4 Neural Networks (MLPs)
Behavior: Neural networks (mlp
, bag_mlp
) are sensitive to input scale and benefit from normalized, well-distributed features for stable and faster training.
Recommended steps (in order):
- Impute missing values (
step_impute_median()
/step_impute_mode()
): Ensures complete data. - Extract date features (
step_date()
/step_holiday()
): Creates numeric features from dates. - Reduce skewness (
step_YeoJohnson()
): Mitigates the impact of skewed distributions and outliers. - Encode categorical variables (
step_dummy()
): Converts categorical features to numeric format forbrulee
,keras
, ornnet
. - Scale features (
step_normalize()
/step_range()
): Scales inputs to a small range for stable training. - Custom transformations (
step_mutate()
): Enables custom feature engineering for complex datasets.
3.5 Discriminant Analysis Models
Behavior: Discriminant analysis models, including linear discriminant analysis (discrim_linear
, engines: MASS
, mda
, sda
, sparsediscrim
), quadratic discriminant analysis (discrim_quad
, engines: MASS
, sparsediscrim
), flexible discriminant analysis (discrim_flexible
, earth
engine), and regularized discriminant analysis (discrim_regularized
, klaR
engine), assume multivariate normality (especially LDA and QDA) and are sensitive to feature scale and multicollinearity. They benefit from normalized and decorrelated features.
Recommended steps (in order):
- Impute missing values (
step_impute_median()
/step_impute_mode()
): Ensures complete data for model fitting. - Extract date features (
step_date()
/step_holiday()
): Creates numeric predictors from dates. - Reduce skewness (
step_YeoJohnson()
): Normalizes skewed predictors, critical for LDA and QDA, which assume normality. - Encode categorical variables (
step_dummy()
): Converts categorical predictors to numeric format, required for most engines. - Scale features (
step_normalize()
): Ensures comparable scale, essential fordiscrim_linear
anddiscrim_quad
. - Address multicollinearity (
step_corr()
/step_pca()
): Reduces redundancy, especially important for LDA and QDA to avoid singularity issues. - Custom transformations (
step_mutate()
): Allows manual feature engineering for complex patterns, particularly fordiscrim_flexible
.
3.6 Naive Bayes Models
Behavior: Naive Bayes models (naive_Bayes
, engines: klaR
, naivebayes
) assume feature independence and are sensitive to feature distributions (e.g., Gaussian Naive Bayes assumes normality). They handle categorical variables natively but benefit from proper encoding and imputation for numeric-only implementations.
Recommended steps (in order):
- Impute missing values (
step_impute_median()
/step_impute_mode()
): Ensures complete data. - Extract date features (
step_date()
/step_holiday()
): Creates numeric or categorical features from dates. - Reduce skewness (
step_YeoJohnson()
): Normalizes skewed numeric predictors for Gaussian Naive Bayes. - Encode categorical variables (
step_dummy()
): Required for numeric-only implementations (e.g.,klaR
with certain settings). - Discretize continuous variables (
step_discretize()
): Optional for non-Gaussian Naive Bayes to simplify continuous feature distributions. - Handle rare categories (
step_other()
/step_novel()
): Manages infrequent or unseen categorical levels.
3.7 Partial Least Squares (PLS)
Behavior: Partial least squares (pls
, mixOmics
engine) is a dimensionality reduction technique that projects features into a lower-dimensional space, making it sensitive to feature scale and multicollinearity. It is often used for classification tasks with correlated predictors.
Recommended steps (in order):
- Impute missing values (
step_impute_median()
/step_impute_mode()
): Ensures complete data. - Extract date features (
step_date()
/step_holiday()
): Creates numeric predictors from dates. - Reduce skewness (
step_YeoJohnson()
): Normalizes skewed predictors for better stability. - Encode categorical variables (
step_dummy()
): Converts categorical predictors to numeric format. - Scale features (
step_normalize()
): Ensures comparable scale, critical for PLS. - Address multicollinearity (
step_pca()
): Optional, as PLS inherently handles multicollinearity, but can further reduce dimensionality.
3.8 MaxEnt Model
Behavior: The MaxEnt model (maxent
, maxnet
engine, tidysdm
package) is used in species distribution modeling and is robust to some feature transformations but benefits from careful handling of categorical and numeric features.
Recommended steps (in order):
- Impute missing values (
step_impute_median()
/step_impute_mode()
): Ensures complete data. - Extract date features (
step_date()
/step_holiday()
): Creates predictive features from temporal data. - Reduce skewness (
step_YeoJohnson()
): Optional, to normalize skewed numeric predictors. - Encode categorical variables (
step_dummy()
): Converts categorical predictors to numeric format formaxnet
. - Handle rare categories (
step_other()
/step_novel()
): Manages infrequent or unseen categorical levels. - Scale features (
step_normalize()
): Optional, but can improve model stability for numeric predictors.
3.9 Null Model
Behavior: The null model (null_model
, parsnip
engine) is a baseline that predicts the majority class or mean, ignoring predictors. It requires minimal preprocessing.
Recommended steps (in order):
- Impute missing values (
step_impute_median()
/step_impute_mode()
): Ensures compatibility with the workflow, though predictors are not used. - Extract date features (
step_date()
/step_holiday()
): Optional, only if the target variable depends on temporal features.
4 A Strategic Baseline Recipe
A robust, general-purpose preprocessing pipeline suitable for most tabular classification problems includes the following steps, applied in this order:
- Impute missing values (
step_impute_median()
/step_impute_mode()
): Handles missing data. - Remove zero-variance predictors (
step_zv()
): Eliminates uninformative features. - Extract date features (
step_date()
/step_holiday()
): Creates predictive features from dates. - Add splines or interactions (
step_ns()
/step_interact()
): Enhances linear models, GAMs, and MARS (optional for trees). - Reduce skewness (
step_YeoJohnson()
): Normalizes skewed numeric predictors. - Handle rare categories (
step_other()
/step_novel()
): Manages categorical predictors. - Encode categorical variables (
step_dummy()
): Prepares data for numeric-only models. - Scale features (
step_normalize()
): Ensures consistent feature scales. - Custom transformations (
step_mutate()
): Allows tailored feature engineering.
For specific model families, adjust this baseline: - Tree-based models (ranger
, xgboost
, C5.0
, bart
): Skip step_ns()
, step_interact()
, and step_normalize()
. - Linear models, GAMs, MARS (glmnet
, gen_additive_mod
, mars
): Include all steps, especially step_normalize()
and step_pca()
. - SVM/KNN: Prioritize step_normalize()
and step_YeoJohnson()
. - Neural Networks (mlp
): Emphasize step_normalize()
and step_YeoJohnson()
. - Discriminant Analysis (discrim_linear
, discrim_quad
): Include step_normalize()
, step_YeoJohnson()
, and step_pca()
. - Naive Bayes: Include step_YeoJohnson()
and optional step_discretize()
. - PLS: Prioritize step_normalize()
and step_YeoJohnson()
. - MaxEnt: Include step_dummy()
and optional step_normalize()
. - Null Model: Minimal steps, primarily step_impute_*()
.
5 Key Takeaways and Practical Checklist
5.1 Key Takeaways
- Treat preprocessing as part of the model. Integrate
recipes
into aworkflow
to ensure consistent transformations across training and prediction. - Group transforms by purpose. Organize steps by the problems they address (e.g., missingness, skewness, scale) for clarity and efficiency.
- Scale matters for sensitive models. Penalized linear models, SVM, KNN, neural networks, discriminant analysis, and PLS require proper scaling.
- Understand engine requirements. Check documentation for specific needs, such as
xgboost
requiring numeric inputs orranger
handling factors natively.
5.2 Practical Checklist
- Inspect the dataset: Check for skewness, outliers, missingness, and categorical complexity.
- For GLMs, GAMs, MARS, SVM, KNN, Neural Networks, LDA/QDA, or PLS: Include
step_YeoJohnson()
(if skewed) andstep_normalize()
. - For
xgboost
,C5.0
, orlightgbm
: Ensure categorical variables are encoded (step_dummy()
) and missing data is imputed (step_impute_*()
). - For
ranger
,rpart
, orpartykit
: Handle missing values, but factors can often remain as-is. - For date-heavy datasets: Use
step_date()
andstep_holiday()
to extract meaningful features. - For complex datasets: Consider
step_mutate()
for custom transformations orstep_discretize()
for simplified relationships. - For Naive Bayes: Use
step_discretize()
for non-Gaussian implementations andstep_YeoJohnson()
for Gaussian assumptions. - For MaxEnt: Ensure categorical encoding with
step_dummy()
and handle missing data. - For PLS or discriminant analysis: Include
step_normalize()
andstep_pca()
for multicollinearity.
6 Additional Considerations
- Cross-validation compatibility: Ensure preprocessing steps are applied within cross-validation folds to avoid data leakage. The
workflows
package integrates seamlessly withrecipes
for this purpose. - Feature selection: Beyond
step_zv()
andstep_nzv()
, considerstep_select()
orstep_importance()
for advanced feature selection based on model-specific importance metrics. - Handling imbalanced data: For classification tasks with imbalanced classes, consider
step_upsample()
orstep_downsample()
from thethemis
package to balance the dataset before modeling. - Outlier handling: Use
step_winsorize()
to cap extreme values, especially for models sensitive to outliers like SVM, neural networks, or discriminant analysis. - Pipeline validation: After defining a recipe, use
check_*()
functions (e.g.,check_missing()
,check_range()
) to validate data integrity before model training.