Data preprocessing is the cornerstone of robust machine learning pipelines. Drawing from Max Kuhn’s Applied Predictive Modeling, this guide explores essential transformations through the {tidymodels} lens. We’ll bridge theory with practical implementation, examining when, why, and how to apply each technique while weighing their tradeoffs.
1 Foundational Transformations
1.1 Centering and Scaling
When to Use:
Models sensitive to predictor magnitude (SVM, KNN, neural networks)
Before dimensionality reduction (PCA) or spatial sign transformations
When predictors have different measurement scales
Why It Matters:
Centers variables around zero (μ=0)
Standardizes variance (σ=1)
Enables meaningful coefficient comparisons
Critical for distance-based calculations and numerical stability
cyl disp hp drat
Min. :-1.225 Min. :-1.2879 Min. :-1.3810 Min. :-1.5646
1st Qu.:-1.225 1st Qu.:-0.8867 1st Qu.:-0.7320 1st Qu.:-0.9661
Median :-0.105 Median :-0.2777 Median :-0.3455 Median : 0.1841
Mean : 0.000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 1.015 3rd Qu.: 0.7688 3rd Qu.: 0.4859 3rd Qu.: 0.6049
Max. : 1.015 Max. : 1.9468 Max. : 2.7466 Max. : 2.4939
wt qsec vs am
Min. :-1.7418 Min. :-1.87401 Min. :-0.868 Min. :-0.8141
1st Qu.:-0.6500 1st Qu.:-0.53513 1st Qu.:-0.868 1st Qu.:-0.8141
Median : 0.1101 Median :-0.07765 Median :-0.868 Median :-0.8141
Mean : 0.0000 Mean : 0.00000 Mean : 0.000 Mean : 0.0000
3rd Qu.: 0.4014 3rd Qu.: 0.58830 3rd Qu.: 1.116 3rd Qu.: 1.1899
Max. : 2.2553 Max. : 2.82675 Max. : 1.116 Max. : 1.1899
gear carb mpg
Min. :-0.9318 Min. :-1.1222 Min. :10.40
1st Qu.:-0.9318 1st Qu.:-0.5030 1st Qu.:15.43
Median : 0.4236 Median :-0.5030 Median :19.20
Mean : 0.0000 Mean : 0.0000 Mean :20.09
3rd Qu.: 0.4236 3rd Qu.: 0.7352 3rd Qu.:22.80
Max. : 1.7789 Max. : 3.2117 Max. :33.90
Pros:
Required for distance-based algorithms
Improves numerical stability
Facilitates convergence in gradient-based methods
Cons:
Loses original measurement context
Not needed for tree-based models
Sensitive to outlier influence
Warning
Always calculate scaling parameters from training data only to avoid data leakage. Resampling should encapsulate preprocessing steps for honest performance estimation.
1.2 Resolving Skewness
When to Use:
Smallest to largest ratio > 20 (max/min)
Right/left-tailed distributions (|skewness| > 1)
Before linear model assumptions
When preparing for PCA or other variance-sensitive methods
Skewness Formula:
Box-Cox Implementation:
data(ames, package ="modeldata")skew_recipe <-recipe(Sale_Price ~ Gr_Liv_Area, data = ames) |>step_BoxCox(Gr_Liv_Area, limits =c(-2, 2)) |># MLE for λprep()tidy(skew_recipe) # Shows selected λ value
# A tibble: 1 × 6
number operation type trained skip id
<int> <chr> <chr> <lgl> <lgl> <chr>
1 1 step BoxCox TRUE FALSE BoxCox_jkCMo
# Calculate original skewnessames |>summarize(skewness = moments::skewness(Gr_Liv_Area))
# A tibble: 1 × 1
skewness
<dbl>
1 1.27
Transformation Options:
λ=2 → Square
λ=0.5 → Square root
λ=-1 → Inverse
λ=0 → Natural log
Pros:
Data-driven transformation selection
Handles zero values gracefully
Continuous transformation spectrum
Cons:
Requires strictly positive values
Loses interpretability
Sensitive to outlier influence
2 Advanced Techniques
2.1 Spatial Sign for Outliers
When to Use:
High-dimensional data
Models sensitive to outlier magnitude (linear regression)
When robust scaling isn’t sufficient
Dealing with radial outliers in multidimensional space
Critical Considerations:
Investigate outliers for data entry errors first
Consider cluster validity before removal
Understand missingness mechanism (MCAR/MAR/MNAR)
Implementation:
outlier_recipe <-recipe(Species ~ Sepal.Length + Sepal.Width, data = iris) |>step_normalize(all_numeric()) |># Mandatory first stepstep_spatialsign(all_numeric()) |>prep()bake(outlier_recipe, new_data =NULL) |>ggplot(aes(Sepal.Length, Sepal.Width, color = Species)) +geom_point()+theme_minimal()
Pros:
Robust to extreme outliers
Maintains relative angles
Non-parametric approach
Cons:
Destroys magnitude information
Requires centered/scaled data
Not suitable for sparse data
2.2 PCA for Data Reduction
Optimal Workflow:
Resolve skewness (Box-Cox/Yeo-Johnson)
Center/scale predictors
Determine components via cross-validation/scree plot
Effective preprocessing requires understanding your data’s story and your model’s needs. As Kuhn emphasizes:
“Preprocessing decisions should be made with the same care as model selection.”
{tidymodels} provides a cohesive framework to implement these transformations systematically. Remember:
Validate preprocessing via nested resampling
Document transformations for reproducibility
Monitor model applicability domain
Consider ethical implications of engineering choices
By mastering these techniques, you’ll transform raw data into model-ready features while avoiding common pitfalls. The art lies in balancing mathematical rigor with practical implementation - a balance {tidymodels} helps achieve elegantly.