recipes icon indicating copy to clipboard operation
recipes copied to clipboard

step_discretize() and step_cut() fail to correctly handle missing values

Open nhward opened this issue 9 months ago • 0 comments

My expectation of recipe steps is that they handle missing values in a manner that is consistent with most R functions. As we see with step_center(), it handles missing values by returning centered values that are also missing.

library(recipes)
v <- rnorm(n = 500, mean = 100, sd = 15)
v[1:5] <- NA  #set the first 5 values to missing

df <- data.frame(v)
rec <- recipe(df)
step_center(rec, v) %>%
  prep() %>%
  bake(new_data = NULL) %>%
  print()
# 1  NA    
# 2  NA    
# 3  NA    
# 4  NA    
# 5  NA    
# 6  -8.52 
# 7 -16.1  
# 8   5.19 
# 9   0.804
# 10   5.17 

The problem

step_discretize() and step_cut() fail to be able to transform missing values as missing values.

step_discretize() has an option called keep_na which ensures that an extra level is created for missing values. Personally, I would prefer to manage this with shadow variables via step_indicate_na(), While I can see where this option might be handy, there exists no way to produce the expected behaviour. See Reprex below.

In contrast step_cut() has no documented parameter for managing NA behavior and reports an error if it encounters any missing values.

Reproducible example

library(recipes)
v <- rnorm(n = 500, mean = 100, sd = 15)
v[1:5] <- NA #set the first 5 values to missing

df <- data.frame(v)
rec <- recipe(df)

step_discretize(rec, v, num_breaks = 10, options = list(keep_na = TRUE)) %>%
  prep() %>%
  bake(new_data = NULL) %>%
  print()

## Getting:
# 1 bin_missing
# 2 bin_missing
# 3 bin_missing
# 4 bin_missing
# 5 bin_missing
# 6 bin04      
# 7 bin07      
# 8 bin04      
# 9 bin03      
# 10 bin02 

## Wanting:
# 1 NA
# 2 NA
# 3 NA
# 4 NA
# 5 NA
# 6 bin04      
# 7 bin07      
# 8 bin04      
# 9 bin03      
# 10 bin02 


# Trying with keep_na = FALSE

step_discretize(rec, v, num_breaks = 10, options = list(keep_na = FALSE)) %>%
  prep() %>%
  bake(new_data = NULL) %>%
  print()

## Getting
# Error in `step_discretize()`:
#   Caused by error in `quantile.default()`:
#   ! missing values and NaN's not allowed if 'na.rm' is FALSE

## Expecting something like:
# 1 NA
# 2 NA
# 3 NA
# 4 NA
# 5 NA
# 6 bin04      
# 7 bin07      
# 8 bin04      
# 9 bin03      
# 10 bin02 


## Now repeating with step_cut()
step_cut(rec, v, breaks = c(70,80,90,100,110,120,130), include_outside_range = TRUE) %>%
  prep() %>%
  bake(new_data = NULL) %>%
  print()

## Getting:
# Error in `step_cut()`:
#   Caused by error in `if (min(var) < min(breaks)) ...`:
#   ! missing value where TRUE/FALSE needed

## Expecting something like:
# 1 NA
# 2 NA
# 3 NA
# 4 NA
# 5 NA
# 6 bin04      
# 7 bin07      
# 8 bin04      
# 9 bin03      
# 10 bin02 

In both the above functions, I believe the consistent application of na.rm=T to the quantile(), min() and max() functions should take you a long way towards fixing the problem.

nhward avatar May 19 '24 07:05 nhward