1 Introduction

Preprocessing is a critical step in building effective machine learning models. It involves transforming, encoding, scaling, imputing, and selecting features to optimize model performance. These steps directly influence generalization, training speed, regularization, and interpretability. The choice of preprocessing techniques, such as scaling, normalization, or encoding, can significantly impact classification metrics, with different algorithms responding uniquely to the same pipeline.

Key principles for a reproducible preprocessing pipeline include:

Preprocessing is part of the model. Use the recipes and workflows packages in R to ensure transformations are trained on the training set and consistently applied to new data, preventing data leakage or mismatched transformations during prediction.
Document the purpose of each step. Group preprocessing steps by the issues they address (e.g., skewness, scale, missingness, collinearity) and explain why specific model families require them.

This guide provides a comprehensive overview of preprocessing with the recipes package, tailored to different model families, along with a practical baseline recipe and checklist.

2 Quick Reference Table

Below is a concise cheat-sheet for common recipes preprocessing steps, their purposes, and the model families they benefit.

Step	Purpose	Helps	Notes
`step_impute_median()` / `step_impute_mode()`	Handle missing data	All models	Apply early in the pipeline to ensure subsequent steps operate on complete data.
`step_zv()` / `step_nzv()`	Remove zero or near-zero variance predictors	All models	Speeds up training by eliminating uninformative features.
`step_date()`	Extract features (e.g., year, month, day) from date columns	All models	Captures predictive signals from temporal data.
`step_holiday()`	Create binary indicators for holidays	All models	Useful for datasets with seasonal or holiday effects.
`step_interact()`	Create interaction terms (e.g., `age * income`)	Linear/GLM, GAM, MARS models	Captures non-additive relationships.
`step_ns()` / `step_bs()`	Generate basis spline expansions	Linear/GLM, GAM, MARS models	Models non-linear relationships in linear models.
`step_YeoJohnson()` / `step_BoxCox()`	Reduce skewness and approximate normality	Linear/GLM, LDA/QDA, Neural Networks, Naive Bayes, PLS	Improves model stability for skewed data.
`step_other()` / `step_novel()`	Manage rare categories and unseen levels	All models	Prevents errors during prediction with new categories.
`step_dummy()`	Convert categorical predictors to numeric indicators	`xgboost`, GLMs, SVMs, Neural Networks, PLS, MaxEnt	Required for models that expect numeric inputs.
`step_normalize()`	Center and scale predictors to mean 0, SD 1	Penalized linear models, SVM, KNN, Neural Networks, LDA/QDA, PLS	Essential for models sensitive to feature scale.
`step_corr()` / `step_pca()`	Reduce dimensionality and multicollinearity	Linear/GLMs, LDA, QDA, KNN, PLS	Improves model stability and performance.
`step_discretize()`	Bin continuous variables into categorical	Tree-based models, GLMs, Naive Bayes	Simplifies relationships for certain algorithms.
`step_mutate()`	Create or transform features manually	All models	Allows custom feature engineering (e.g., ratios, log transformations).

3 How Recipe Steps Affect Model Families

Preprocessing requirements vary by model family due to their differing sensitivities to feature scale, distribution, and structure. Below, we outline recommended steps for each family in the correct order of application.

3.1 Tree-Based and Rule-Based Classification

Behavior: Tree-based models (e.g., ranger, xgboost, rpart, C5_rules, bart, bag_tree, boost_tree, rand_forest, rule_fit) rely on splits based on feature ordering and thresholds, making them robust to monotonic transformations and less sensitive to feature scale. However, they benefit from explicit date features and proper handling of categorical variables.