Introduction

Data preprocessing is the cornerstone of robust machine learning pipelines. Drawing from Max Kuhn’s Applied Predictive Modeling, this guide explores essential transformations through the {tidymodels} lens. We’ll bridge theory with practical implementation, examining when, why, and how to apply each technique while weighing their tradeoffs.

1 Foundational Transformations

1.1 Centering and Scaling

When to Use:

Models sensitive to predictor magnitude (SVM, KNN, neural networks)
Before dimensionality reduction (PCA) or spatial sign transformations
When predictors have different measurement scales

Why It Matters:

Centers variables around zero (μ=0)
Standardizes variance (σ=1)
Enables meaningful coefficient comparisons
Critical for distance-based calculations and numerical stability

Implementation:

library(tidymodels)

norm_recipe <- recipe(mpg ~ ., data = mtcars) |>
  step_normalize(all_numeric_predictors()) |>
  prep()

bake(norm_recipe, new_data = NULL) |> summary()

      cyl              disp               hp               drat        
 Min.   :-1.225   Min.   :-1.2879   Min.   :-1.3810   Min.   :-1.5646  
 1st Qu.:-1.225   1st Qu.:-0.8867   1st Qu.:-0.7320   1st Qu.:-0.9661  
 Median :-0.105   Median :-0.2777   Median :-0.3455   Median : 0.1841  
 Mean   : 0.000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 1.015   3rd Qu.: 0.7688   3rd Qu.: 0.4859   3rd Qu.: 0.6049  
 Max.   : 1.015   Max.   : 1.9468   Max.   : 2.7466   Max.   : 2.4939  
       wt               qsec                vs               am         
 Min.   :-1.7418   Min.   :-1.87401   Min.   :-0.868   Min.   :-0.8141  
 1st Qu.:-0.6500   1st Qu.:-0.53513   1st Qu.:-0.868   1st Qu.:-0.8141  
 Median : 0.1101   Median :-0.07765   Median :-0.868   Median :-0.8141  
 Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.000   Mean   : 0.0000  
 3rd Qu.: 0.4014   3rd Qu.: 0.58830   3rd Qu.: 1.116   3rd Qu.: 1.1899  
 Max.   : 2.2553   Max.   : 2.82675   Max.   : 1.116   Max.   : 1.1899  
      gear              carb              mpg       
 Min.   :-0.9318   Min.   :-1.1222   Min.   :10.40  
 1st Qu.:-0.9318   1st Qu.:-0.5030   1st Qu.:15.43  
 Median : 0.4236   Median :-0.5030   Median :19.20  
 Mean   : 0.0000   Mean   : 0.0000   Mean   :20.09  
 3rd Qu.: 0.4236   3rd Qu.: 0.7352   3rd Qu.:22.80  
 Max.   : 1.7789   Max.   : 3.2117   Max.   :33.90

Pros:

Required for distance-based algorithms
Improves numerical stability
Facilitates convergence in gradient-based methods

Cons:

Loses original measurement context
Not needed for tree-based models
Sensitive to outlier influence

Warning

Always calculate scaling parameters from training data only to avoid data leakage. Resampling should encapsulate preprocessing steps for honest performance estimation.

1.2 Resolving Skewness

When to Use:

Smallest to largest ratio > 20 (max/min)
Right/left-tailed distributions (|skewness| > 1)
Before linear model assumptions
When preparing for PCA or other variance-sensitive methods

Skewness Formula:

s k e w n e s s = \frac{\sum (x_{i} - \bar{x})^{3}}{(n - 1) v^{3 / 2}} where v = \frac{\sum (x_{i} - \bar{x})^{2}}{(n - 1)}

Box-Cox Implementation:

data(ames, package = "modeldata")

skew_recipe <- recipe(Sale_Price ~ Gr_Liv_Area, data = ames) |>
  step_BoxCox(Gr_Liv_Area, limits = c(-2, 2)) |> # MLE for λ
  prep()

tidy(skew_recipe) # Shows selected λ value

# A tibble: 1 × 6
  number operation type   trained skip  id          
   <int> <chr>     <chr>  <lgl>   <lgl> <chr>       
1      1 step      BoxCox TRUE    FALSE BoxCox_jkCMo

# Calculate original skewness
ames |> 
  summarize(skewness = moments::skewness(Gr_Liv_Area))

# A tibble: 1 × 1
  skewness
     <dbl>
1     1.27

Transformation Options:

λ=2 → Square
λ=0.5 → Square root
λ=-1 → Inverse
λ=0 → Natural log

Pros:

Data-driven transformation selection
Handles zero values gracefully
Continuous transformation spectrum

Cons:

Requires strictly positive values
Loses interpretability
Sensitive to outlier influence

2 Advanced Techniques

2.1 Spatial Sign for Outliers

When to Use:

High-dimensional data
Models sensitive to outlier magnitude (linear regression)
When robust scaling isn’t sufficient
Dealing with radial outliers in multidimensional space

Critical Considerations:

Investigate outliers for data entry errors first
Consider cluster validity before removal
Understand missingness mechanism (MCAR/MAR/MNAR)

Implementation:

outlier_recipe <- recipe(Species ~ Sepal.Length + Sepal.Width, data = iris) |>
  step_normalize(all_numeric()) |> # Mandatory first step
  step_spatialsign(all_numeric()) |>
  prep()

bake(outlier_recipe, new_data = NULL) |>
  ggplot(aes(Sepal.Length, Sepal.Width, color = Species)) +
  geom_point()+
  theme_minimal()

Pros:

Robust to extreme outliers
Maintains relative angles
Non-parametric approach

Cons:

Destroys magnitude information
Requires centered/scaled data
Not suitable for sparse data

2.2 PCA for Data Reduction

Optimal Workflow:

Resolve skewness (Box-Cox/Yeo-Johnson)
Center/scale predictors
Determine components via cross-validation/scree plot
Validate via resampling

Component Selection:

Retain components before scree plot elbow
Cumulative variance >80-90%
Cross-validate performance

Implementation:

pca_recipe <- recipe(Species ~ ., data = iris) |>
  step_normalize(all_numeric()) |>
  step_pca(all_numeric(), num_comp = 4L) |> #tune()
  prep()

# Scree plot visualization
pca_vars <- tidy(pca_recipe, 2, type = "variance")

pca_vars |> 
  filter(terms == "percent variance") |>
  ggplot(aes(component, value)) +
  geom_line() +
  geom_point() +
  labs(title = "Scree Plot", y = "% Variance Explained") +
  theme_minimal()

# Component interpretation
tidy(pca_recipe, 2) |> 
  filter(component == "PC1") |> 
  arrange(-abs(value))

# A tibble: 4 × 4
  terms         value component id       
  <chr>         <dbl> <chr>     <chr>    
1 Petal.Length  0.580 PC1       pca_Lb4iQ
2 Petal.Width   0.565 PC1       pca_Lb4iQ
3 Sepal.Length  0.521 PC1       pca_Lb4iQ
4 Sepal.Width  -0.269 PC1       pca_Lb4iQ

Pros:

Removes multicollinearity
Reduces computational load
Reveals latent structure

Cons:

Loss of interpretability
Sensitive to scaling
Linear assumptions
Supervised methods (PLS) may be preferable for outcome-aware reduction

3 Handling Data Challenges

3.1 Missing Value Imputation

Critical Considerations:

Informative missingness: Is missing pattern related to outcome?
Censored data: Different treatment than MCAR/MAR
5% missing → Consider removal
Type-appropriate methods (KNN vs regression)

Imputation Strategies:

Scenario	Approach
<5% missing	Median/mode imputation
Continuous predictors	KNN, linear regression, bagging
Categorical	Mode, multinomial logit
High dimensionality	Regularized models, MICE

Implementation:

ames2 <- ames
ames2$Year_Built2 <- ames2$Year_Built

set.seed(5858)
ames2[sample.int(2930, 1000), c("Year_Built2")] <- NA_integer_
ames2[sample.int(2930, 800), c("Lot_Frontage")] <- NA_integer_

impute_recipe <- recipe(Sale_Price ~ Lot_Frontage + Year_Built2 + Year_Built, data = ames2) |>
  step_impute_knn(Lot_Frontage, neighbors = 3L) |> #tune()
  step_impute_linear(Year_Built2, impute_with = imp_vars(Year_Built)) |>
  prep()

# Assess imputation quality
complete_data <- bake(impute_recipe, new_data = ames2)
cor(complete_data$Year_Built, complete_data$Year_Built2, use = "complete.obs")

[1] 1

cor(complete_data$Lot_Frontage, ames$Lot_Frontage, use = "complete.obs")

[1] 0.8296254

3.2 Feature Filtering

Near-Zero Variance Detection:

Frequency ratio > 20
Unique values < 10%
Percent unique = n_unique/n * 100

nzv_recipe <- recipe(Species ~ ., data = iris) |>
  step_nzv(all_predictors(), freq_cut = 95/5, unique_cut = 10) |>
  prep()

tidy(nzv_recipe)

# A tibble: 1 × 6
  number operation type  trained skip  id       
   <int> <chr>     <chr> <lgl>   <lgl> <chr>    
1      1 step      nzv   TRUE    FALSE nzv_sEXIo

Multicollinearity Handling:

Variance Inflation Factor (VIF) > 5-10
Pairwise correlation threshold
Iterative removal algorithm

corr_recipe <- recipe(Species ~ ., data = iris) |>
  step_corr(all_numeric(), threshold = 0.9, method = "spearman") |>
  prep()

tidy(corr_recipe)

# A tibble: 1 × 6
  number operation type  trained skip  id        
   <int> <chr>     <chr> <lgl>   <lgl> <chr>     
1      1 step      corr  TRUE    FALSE corr_aKa2d

4 Strategic Feature Engineering

4.1 Categorical Encoding & Nonlinear Terms

Best Practices:

Dummy variables for nominal predictors (one-hot encoding)
Ordered factors for ordinal categories
Include interaction terms where domain knowledge suggests
Add polynomial terms for known nonlinear relationships

Example:

nonlinear_recipe <- recipe(Species ~ ., data = iris) |>
  step_dummy(all_nominal(), -all_outcomes()) |>
  step_poly(Sepal.Length, degree = 2) |>
  step_interact(~ Sepal.Width:Petal.Length) |>
  prep()

4.2 Distance to Class Centroids

When to Use:

Classification problems
Cluster-aware feature engineering
Improving linear separability
Augmenting existing feature set

Implementation:

centroid_recipe <- recipe(Species ~ ., data = iris) |>
  step_classdist(all_numeric(), class = "Species", pool = FALSE) |>
  prep()

bake(centroid_recipe, new_data = NULL) |>
  select(starts_with("classdist_")) |>
  head()

# A tibble: 6 × 3
  classdist_setosa classdist_versicolor classdist_virginica
             <dbl>                <dbl>               <dbl>
1           -0.800                 4.74                5.21
2            0.733                 4.42                5.04
3            0.250                 4.55                5.08
4            0.534                 4.42                4.95
5           -0.272                 4.79                5.22
6            1.31                  4.79                5.21

4.3 Binning Strategies

When to Avoid:

Manual binning pre-analysis
With tree-based models
Small sample sizes
When interpretability trumps accuracy

Ethical Considerations:

Medical diagnostics require maximum accuracy
Legal implications of arbitrary thresholds
Potential bias introduction through careless discretization

Smart Discretization:

bin_recipe <- recipe(Sale_Price ~ Gr_Liv_Area, data = ames) |>
  step_discretize(Gr_Liv_Area, num_breaks = 4, min_unique = 10) |>
  prep()

bake(bin_recipe, new_data = NULL) |>
  count(Gr_Liv_Area)

# A tibble: 4 × 2
  Gr_Liv_Area     n
  <fct>       <int>
1 bin1          735
2 bin2          733
3 bin3          729
4 bin4          733

Conclusion

Effective preprocessing requires understanding your data’s story and your model’s needs. As Kuhn emphasizes:

“Preprocessing decisions should be made with the same care as model selection.”

{tidymodels} provides a cohesive framework to implement these transformations systematically. Remember:

Validate preprocessing via nested resampling
Document transformations for reproducibility
Monitor model applicability domain
Consider ethical implications of engineering choices

By mastering these techniques, you’ll transform raw data into model-ready features while avoiding common pitfalls. The art lies in balancing mathematical rigor with practical implementation - a balance {tidymodels} helps achieve elegantly.