The Art of Readiable R code

Good Practices
Coding Style
Author

Angel Feliz

Published

October 22, 2023

Introduction

When we start our journey as programmers it’s normal to get excited by the new possibilities. We get the capacity to do many things that otherwise would be impossible, making projects faster and assuring consistency.

But the problems start when you need to modify a script that you wrote 6 months ago. That’s when you find out that you don’t remember why you were applying some specific filters or calculating a value in a odd way.

As a Reporting Analyst and I am always creating and changing scripts and after applying the tips provided in The Art of Readable Code by Dustin Boswell and Trevor Foucher I could reduce the time needed to apply changes from 5 to 2 days (60% faster).

1 How do you know if your code is readable?

In order for a code to be readable, it needs to:

  1. Have explicit names for variables, functions and function arguments.

  2. Have comments that explain the reasons behind the code. At the end, the reader should know as much as the writer did.

  3. Be understood without reading it twice.

2 Practical Tecniques

Once we know what we want to achieve, it is really useful to know some techniques that might help us for that purpose. In this article, we will use R to make all the our examples and the datasets::mtcars data.frame as it is widely used for simple examples.

2.1 Creating explicit names

2.1.1 Naming variables

Defining good variable names is more important than writing a good comment and we should try to give as much context as possible in the variable name. To make this possible:

  1. Name based on variable value.

    • Boolean variables can use words like is, has and should avoid negations. For example: is_integer, has_money and should_end.
    • Looping index can have a name followed by the the suffix i. For example: club_i and table_i.
  2. Add dimensions unit a suffix. For example: price_usd, mass_kg and distance_miles.

  3. Never change the variable’s value in different sections, instead create a new variable making explicit the change in the name. For example, we can have the variable priceand latter we can create the variable price_discount.

Coding Example

Let’s apply these points to the mtcars.

mtcars_new_names <-
  c("mpg" = "miles_per_gallon",
    "cyl"= "cylinders_count",
    "disp" = "displacement_in3",
    "hp" = "power_hp",
    "drat" = "rear_axle_ratio",
    "wt" = "weight_klb",
    "qsec" = "quarter_mile_secs",
    "vs" = "engine_is_straight",
    "am" = "transmission_is_manual",
    "gear" = "gear_count",
    "carb" = "carburetor_count")

mtcars_renamed <- mtcars
names(mtcars_renamed) <- mtcars_new_names[names(mtcars)]

str(mtcars_renamed)
'data.frame':   32 obs. of  11 variables:
 $ miles_per_gallon      : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cylinders_count       : num  6 6 4 6 8 6 8 4 4 6 ...
 $ displacement_in3      : num  160 160 108 258 360 ...
 $ power_hp              : num  110 110 93 110 175 105 245 62 95 123 ...
 $ rear_axle_ratio       : num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ weight_klb            : num  2.62 2.88 2.32 3.21 3.44 ...
 $ quarter_mile_secs     : num  16.5 17 18.6 19.4 17 ...
 $ engine_is_straight    : num  0 0 1 1 0 1 0 1 1 1 ...
 $ transmission_is_manual: num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear_count            : num  4 4 4 3 3 3 3 4 4 4 ...
 $ carburetor_count      : num  4 4 1 1 2 1 4 2 2 4 ...
Note

To write good variable names might take some iteration and you might need to play devil’s advocate in under to find out a better name than the initial one.

2.1.2 Defining functions

Creating explicit functions names can transform a complex process into a simple one.

  1. Start the function with an explicit verb to avoid misunderstandings.
Word Alternatives
send deliver, dispatch, announce, distribute, route
find search, extract, locate, recover
start launch, create, begin, open
make create, set up, build, generate, compose, add, new
  1. The function name must describe its output.

  2. A function should do only one thing, otherwise break the functions in more simpler ones to keep the name explicit.

  3. Use the following words to define range arguments.

Word Use
min and max Useful to denominate included limits
first and last Useful to denominate exclusive limits
begin and end Useful to denominate either inclusive or exclusive limits

Coding Example

keep_rows_in_percentile_range <- function(DF,
                                          var_name,
                                          min_prob,
                                          max_prob){
  
  if(!is.data.frame(DF)) stop("DF should be a data.frame")
  
  values <- DF[[var_name]]
  if(!is.numeric(values)) stop("var_name should be a numeric column of DF")
  
  min_value <- quantile(values, na.rm = TRUE, probs = min_prob)
  max_value <- quantile(values, na.rm = TRUE, probs = max_prob)
  
  value_in_range <- 
    values >= min_value & 
    values <= max_value
  
  return(DF[value_in_range, ])
  
}

2.2 Commenting correctly

The first step to have a commented project is to have a README file explaining how the code works in a way that to should be enough to present the project to a new team member, but it is also important to add comments to:

  1. Explain how custom functions behave in several situations with minimal examples.

  2. Explain the reasons behind the decisions that have been taken related to coding style and business logic, like method and constant selection.

  3. Make explicit pending problems to solve and the initial idea we have to start the solution.

  4. Avoid commenting bad names, fix them instead.

  5. Summarize coding sections with a description faster to read than the original code.

Coding Example

Let’s comment our custom function to explain each point.

# 1. Behavior
# This function can filter the values of any  data.frame if the var_name
# is numeric no matter if the column has missing values as it will omit them

# 2. Reasons behind decisions
# As we are not expecting to make inferences imputation is not necessary.

keep_rows_in_percentile_range <- function(DF,
                                          var_name,
                                          min_prob,
                                          max_prob){

  # 5. Reading the code is faster than reading a comment, so we don't need it
  if(!is.data.frame(DF)) stop("DF should be a data.frame")

  # 2. Reasons behind decisions
  # We are going to use this vector many times and 
  # saving it as a variable makes the code much easier to read
  values <- DF[[var_name]]
  
  # 5. Reading the code is faster than reading a comment, so we don't need it
  if(!is.numeric(values)) stop("var_name should be a numeric column of DF")
  
  # 2. Reasons behind decisions
  # Even though a single quantile call could return both values in a vector
  # it is much simpler to understand if we save each value in a variable
  min_value <- quantile(values, na.rm = TRUE, probs = min_prob)
  max_value <- quantile(values, na.rm = TRUE, probs = max_prob)
  
  # 4. The boolean test has an explicit name
  value_in_range <- 
    values >= min_value & 
    values <= max_value
  
  return(DF[value_in_range, ])
  
}
Note

Writing good comments can be challenging, so you better do it in 3 steps:

  1. Write down whatever comment is on your mind

  2. Read the comment and see what needs to be improved

  3. Make the needed improvements

2.3 Code style

It is important to apply a coding style that make easy to scan the code before going into detail to certain parts. Some advice to improve code style are:

  1. Similar code should look similar and be grouped in blocks, it will facilitate finding spelling mistakes and prevent repetitive comments.

We can see how this tips was applied in the keep_rows_in_percentile_range function.

min_value <- quantile(values, na.rm = TRUE, probs = min_prob)
max_value <- quantile(values, na.rm = TRUE, probs = max_prob)
  1. Avoid keeping temporal variables in the global environment .GlobalEnv, instead create a function to make clear the purpose or use pipes (base::|> or magrittr::%>%).

To ensure this, we created our custom function.

mtcars_renamed |>
  keep_rows_in_percentile_range(var_name = "miles_per_gallon", 
                                min_prob = 0.20,
                                max_prob = 0.50) |>
  nrow()
[1] 11
  1. Avoid writing nested if statements by negating each Boolean test.

If we hadn’t taken that into consideration our function would be much harder to read.

keep_rows_in_percentile_range <- function(DF,
                                          var_name,
                                          min_prob,
                                          max_prob){
  
  if(is.data.frame(DF)){
    
    values <- DF[[var_name]]
    
    if(is.numeric(values)){
      
      min_value <- quantile(values, na.rm = TRUE, probs = min_prob)
      max_value <- quantile(values, na.rm = TRUE, probs = max_prob)
      
      value_in_range <- 
        values >= min_value & 
        values <= max_value
      
      return(DF[value_in_range, ])
      
    }else{
      
      stop("var_name should be a numeric column of DF")
      
    }

  }else{
    
    stop("DF should be a data.frame")
    
  }
  
}
  1. Boolean validations should be stored in variables should be stored variables to make explicit what was the test about.

  2. Always write constants on the right side of the comparison.

  3. Simplify Boolean comparisons De Morgan’s Law: ! ( A | B) == !A & !B

In our custom function min_value and max_value works like constants in comparisons to values

value_in_range <- 
  values >= min_value & 
  values <= max_value

Conclusion

Writing high quality code it’s important, but it takes time and practice. At this point, I just want to share the following words by Hadley Wickham.

The only way to write good code is to write tons of shitty code first. Feeling shame about bad code stops you from getting to good code.

I hope you have learnt something new that you find useful.