Data Preprocessing with {tidymodels} and R

Author

Angel Feliz

Published

April 27, 2025

Introduction

Data preprocessing is the cornerstone of robust machine learning pipelines. Drawing from Max Kuhn’s Applied Predictive Modeling, this guide explores essential transformations through the {tidymodels} lens. We’ll bridge theory with practical implementation, examining when, why, and how to apply each technique while weighing their tradeoffs.

1 Foundational Transformations

1.1 Centering and Scaling

When to Use:

  • Models sensitive to predictor magnitude (SVM, KNN, neural networks)
  • Before dimensionality reduction (PCA) or spatial sign transformations
  • When predictors have different measurement scales

Why It Matters:

  • Centers variables around zero (μ=0)
  • Standardizes variance (σ=1)
  • Enables meaningful coefficient comparisons
  • Critical for distance-based calculations and numerical stability

Implementation:

library(tidymodels)

norm_recipe <- recipe(mpg ~ ., data = mtcars) |>
  step_normalize(all_numeric_predictors()) |>
  prep()

bake(norm_recipe, new_data = NULL) |> summary()
      cyl              disp               hp               drat        
 Min.   :-1.225   Min.   :-1.2879   Min.   :-1.3810   Min.   :-1.5646  
 1st Qu.:-1.225   1st Qu.:-0.8867   1st Qu.:-0.7320   1st Qu.:-0.9661  
 Median :-0.105   Median :-0.2777   Median :-0.3455   Median : 0.1841  
 Mean   : 0.000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 1.015   3rd Qu.: 0.7688   3rd Qu.: 0.4859   3rd Qu.: 0.6049  
 Max.   : 1.015   Max.   : 1.9468   Max.   : 2.7466   Max.   : 2.4939  
       wt               qsec                vs               am         
 Min.   :-1.7418   Min.   :-1.87401   Min.   :-0.868   Min.   :-0.8141  
 1st Qu.:-0.6500   1st Qu.:-0.53513   1st Qu.:-0.868   1st Qu.:-0.8141  
 Median : 0.1101   Median :-0.07765   Median :-0.868   Median :-0.8141  
 Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.000   Mean   : 0.0000  
 3rd Qu.: 0.4014   3rd Qu.: 0.58830   3rd Qu.: 1.116   3rd Qu.: 1.1899  
 Max.   : 2.2553   Max.   : 2.82675   Max.   : 1.116   Max.   : 1.1899  
      gear              carb              mpg       
 Min.   :-0.9318   Min.   :-1.1222   Min.   :10.40  
 1st Qu.:-0.9318   1st Qu.:-0.5030   1st Qu.:15.43  
 Median : 0.4236   Median :-0.5030   Median :19.20  
 Mean   : 0.0000   Mean   : 0.0000   Mean   :20.09  
 3rd Qu.: 0.4236   3rd Qu.: 0.7352   3rd Qu.:22.80  
 Max.   : 1.7789   Max.   : 3.2117   Max.   :33.90  

Pros:

  • Required for distance-based algorithms
  • Improves numerical stability
  • Facilitates convergence in gradient-based methods

Cons:

  • Loses original measurement context
  • Not needed for tree-based models
  • Sensitive to outlier influence
Warning

Always calculate scaling parameters from training data only to avoid data leakage. Resampling should encapsulate preprocessing steps for honest performance estimation.

1.2 Resolving Skewness

When to Use:

  • Smallest to largest ratio > 20 (max/min)
  • Right/left-tailed distributions (|skewness| > 1)
  • Before linear model assumptions
  • When preparing for PCA or other variance-sensitive methods

Skewness Formula:

skewness=(xix¯)3(n1)v3/2where v=(xix¯)2(n1)

Box-Cox Implementation:

data(ames, package = "modeldata")

skew_recipe <- recipe(Sale_Price ~ Gr_Liv_Area, data = ames) |>
  step_BoxCox(Gr_Liv_Area, limits = c(-2, 2)) |> # MLE for λ
  prep()

tidy(skew_recipe) # Shows selected λ value
# A tibble: 1 × 6
  number operation type   trained skip  id          
   <int> <chr>     <chr>  <lgl>   <lgl> <chr>       
1      1 step      BoxCox TRUE    FALSE BoxCox_jkCMo
# Calculate original skewness
ames |> 
  summarize(skewness = moments::skewness(Gr_Liv_Area))
# A tibble: 1 × 1
  skewness
     <dbl>
1     1.27

Transformation Options:

  • λ=2 → Square
  • λ=0.5 → Square root
  • λ=-1 → Inverse
  • λ=0 → Natural log

Pros:

  • Data-driven transformation selection
  • Handles zero values gracefully
  • Continuous transformation spectrum

Cons:

  • Requires strictly positive values
  • Loses interpretability
  • Sensitive to outlier influence

2 Advanced Techniques

2.1 Spatial Sign for Outliers

When to Use:

  • High-dimensional data
  • Models sensitive to outlier magnitude (linear regression)
  • When robust scaling isn’t sufficient
  • Dealing with radial outliers in multidimensional space

Critical Considerations:

  • Investigate outliers for data entry errors first
  • Consider cluster validity before removal
  • Understand missingness mechanism (MCAR/MAR/MNAR)

Implementation:

outlier_recipe <- recipe(Species ~ Sepal.Length + Sepal.Width, data = iris) |>
  step_normalize(all_numeric()) |> # Mandatory first step
  step_spatialsign(all_numeric()) |>
  prep()

bake(outlier_recipe, new_data = NULL) |>
  ggplot(aes(Sepal.Length, Sepal.Width, color = Species)) +
  geom_point()+
  theme_minimal()

Pros:

  • Robust to extreme outliers
  • Maintains relative angles
  • Non-parametric approach

Cons:

  • Destroys magnitude information
  • Requires centered/scaled data
  • Not suitable for sparse data

2.2 PCA for Data Reduction

Optimal Workflow:

  1. Resolve skewness (Box-Cox/Yeo-Johnson)
  2. Center/scale predictors
  3. Determine components via cross-validation/scree plot
  4. Validate via resampling

Component Selection:

  • Retain components before scree plot elbow
  • Cumulative variance >80-90%
  • Cross-validate performance

Implementation:

pca_recipe <- recipe(Species ~ ., data = iris) |>
  step_normalize(all_numeric()) |>
  step_pca(all_numeric(), num_comp = 4L) |> #tune()
  prep()

# Scree plot visualization
pca_vars <- tidy(pca_recipe, 2, type = "variance")

pca_vars |> 
  filter(terms == "percent variance") |>
  ggplot(aes(component, value)) +
  geom_line() +
  geom_point() +
  labs(title = "Scree Plot", y = "% Variance Explained") +
  theme_minimal()

# Component interpretation
tidy(pca_recipe, 2) |> 
  filter(component == "PC1") |> 
  arrange(-abs(value))
# A tibble: 4 × 4
  terms         value component id       
  <chr>         <dbl> <chr>     <chr>    
1 Petal.Length  0.580 PC1       pca_Lb4iQ
2 Petal.Width   0.565 PC1       pca_Lb4iQ
3 Sepal.Length  0.521 PC1       pca_Lb4iQ
4 Sepal.Width  -0.269 PC1       pca_Lb4iQ

Pros:

  • Removes multicollinearity
  • Reduces computational load
  • Reveals latent structure

Cons:

  • Loss of interpretability
  • Sensitive to scaling
  • Linear assumptions
  • Supervised methods (PLS) may be preferable for outcome-aware reduction

3 Handling Data Challenges

3.1 Missing Value Imputation

Critical Considerations:

  • Informative missingness: Is missing pattern related to outcome?
  • Censored data: Different treatment than MCAR/MAR
  • 5% missing → Consider removal

  • Type-appropriate methods (KNN vs regression)

Imputation Strategies:

Scenario Approach
<5% missing Median/mode imputation
Continuous predictors KNN, linear regression, bagging
Categorical Mode, multinomial logit
High dimensionality Regularized models, MICE

Implementation:

ames2 <- ames
ames2$Year_Built2 <- ames2$Year_Built

set.seed(5858)
ames2[sample.int(2930, 1000), c("Year_Built2")] <- NA_integer_
ames2[sample.int(2930, 800), c("Lot_Frontage")] <- NA_integer_

impute_recipe <- recipe(Sale_Price ~ Lot_Frontage + Year_Built2 + Year_Built, data = ames2) |>
  step_impute_knn(Lot_Frontage, neighbors = 3L) |> #tune()
  step_impute_linear(Year_Built2, impute_with = imp_vars(Year_Built)) |>
  prep()

# Assess imputation quality
complete_data <- bake(impute_recipe, new_data = ames2)
cor(complete_data$Year_Built, complete_data$Year_Built2, use = "complete.obs")
[1] 1
cor(complete_data$Lot_Frontage, ames$Lot_Frontage, use = "complete.obs")
[1] 0.8296254

3.2 Feature Filtering

Near-Zero Variance Detection:

  • Frequency ratio > 20
  • Unique values < 10%
  • Percent unique = n_unique/n * 100
nzv_recipe <- recipe(Species ~ ., data = iris) |>
  step_nzv(all_predictors(), freq_cut = 95/5, unique_cut = 10) |>
  prep()

tidy(nzv_recipe)
# A tibble: 1 × 6
  number operation type  trained skip  id       
   <int> <chr>     <chr> <lgl>   <lgl> <chr>    
1      1 step      nzv   TRUE    FALSE nzv_sEXIo

Multicollinearity Handling:

  • Variance Inflation Factor (VIF) > 5-10
  • Pairwise correlation threshold
  • Iterative removal algorithm
corr_recipe <- recipe(Species ~ ., data = iris) |>
  step_corr(all_numeric(), threshold = 0.9, method = "spearman") |>
  prep()

tidy(corr_recipe)
# A tibble: 1 × 6
  number operation type  trained skip  id        
   <int> <chr>     <chr> <lgl>   <lgl> <chr>     
1      1 step      corr  TRUE    FALSE corr_aKa2d

4 Strategic Feature Engineering

4.1 Categorical Encoding & Nonlinear Terms

Best Practices:

  • Dummy variables for nominal predictors (one-hot encoding)
  • Ordered factors for ordinal categories
  • Include interaction terms where domain knowledge suggests
  • Add polynomial terms for known nonlinear relationships

Example:

nonlinear_recipe <- recipe(Species ~ ., data = iris) |>
  step_dummy(all_nominal(), -all_outcomes()) |>
  step_poly(Sepal.Length, degree = 2) |>
  step_interact(~ Sepal.Width:Petal.Length) |>
  prep()

4.2 Distance to Class Centroids

When to Use:

  • Classification problems
  • Cluster-aware feature engineering
  • Improving linear separability
  • Augmenting existing feature set

Implementation:

centroid_recipe <- recipe(Species ~ ., data = iris) |>
  step_classdist(all_numeric(), class = "Species", pool = FALSE) |>
  prep()

bake(centroid_recipe, new_data = NULL) |>
  select(starts_with("classdist_")) |>
  head()
# A tibble: 6 × 3
  classdist_setosa classdist_versicolor classdist_virginica
             <dbl>                <dbl>               <dbl>
1           -0.800                 4.74                5.21
2            0.733                 4.42                5.04
3            0.250                 4.55                5.08
4            0.534                 4.42                4.95
5           -0.272                 4.79                5.22
6            1.31                  4.79                5.21

4.3 Binning Strategies

When to Avoid:

  • Manual binning pre-analysis
  • With tree-based models
  • Small sample sizes
  • When interpretability trumps accuracy

Ethical Considerations:

  • Medical diagnostics require maximum accuracy
  • Legal implications of arbitrary thresholds
  • Potential bias introduction through careless discretization

Smart Discretization:

bin_recipe <- recipe(Sale_Price ~ Gr_Liv_Area, data = ames) |>
  step_discretize(Gr_Liv_Area, num_breaks = 4, min_unique = 10) |>
  prep()

bake(bin_recipe, new_data = NULL) |>
  count(Gr_Liv_Area)
# A tibble: 4 × 2
  Gr_Liv_Area     n
  <fct>       <int>
1 bin1          735
2 bin2          733
3 bin3          729
4 bin4          733

Conclusion

Effective preprocessing requires understanding your data’s story and your model’s needs. As Kuhn emphasizes:

“Preprocessing decisions should be made with the same care as model selection.”

{tidymodels} provides a cohesive framework to implement these transformations systematically. Remember:

  • Validate preprocessing via nested resampling
  • Document transformations for reproducibility
  • Monitor model applicability domain
  • Consider ethical implications of engineering choices

By mastering these techniques, you’ll transform raw data into model-ready features while avoiding common pitfalls. The art lies in balancing mathematical rigor with practical implementation - a balance {tidymodels} helps achieve elegantly.