recipes
recipes copied to clipboard
step_discretize() and step_cut() fail to correctly handle missing values
My expectation of recipe steps is that they handle missing values in a manner that is consistent with most R functions. As we see with step_center()
, it handles missing values by returning centered values that are also missing.
library(recipes)
v <- rnorm(n = 500, mean = 100, sd = 15)
v[1:5] <- NA #set the first 5 values to missing
df <- data.frame(v)
rec <- recipe(df)
step_center(rec, v) %>%
prep() %>%
bake(new_data = NULL) %>%
print()
# 1 NA
# 2 NA
# 3 NA
# 4 NA
# 5 NA
# 6 -8.52
# 7 -16.1
# 8 5.19
# 9 0.804
# 10 5.17
The problem
step_discretize()
and step_cut()
fail to be able to transform missing values as missing values.
step_discretize()
has an option called keep_na which ensures that an extra level is created for missing values. Personally, I would prefer to manage this with shadow variables via step_indicate_na()
, While I can see where this option might be handy, there exists no way to produce the expected behaviour. See Reprex below.
In contrast step_cut()
has no documented parameter for managing NA behavior and reports an error if it encounters any missing values.
Reproducible example
library(recipes)
v <- rnorm(n = 500, mean = 100, sd = 15)
v[1:5] <- NA #set the first 5 values to missing
df <- data.frame(v)
rec <- recipe(df)
step_discretize(rec, v, num_breaks = 10, options = list(keep_na = TRUE)) %>%
prep() %>%
bake(new_data = NULL) %>%
print()
## Getting:
# 1 bin_missing
# 2 bin_missing
# 3 bin_missing
# 4 bin_missing
# 5 bin_missing
# 6 bin04
# 7 bin07
# 8 bin04
# 9 bin03
# 10 bin02
## Wanting:
# 1 NA
# 2 NA
# 3 NA
# 4 NA
# 5 NA
# 6 bin04
# 7 bin07
# 8 bin04
# 9 bin03
# 10 bin02
# Trying with keep_na = FALSE
step_discretize(rec, v, num_breaks = 10, options = list(keep_na = FALSE)) %>%
prep() %>%
bake(new_data = NULL) %>%
print()
## Getting
# Error in `step_discretize()`:
# Caused by error in `quantile.default()`:
# ! missing values and NaN's not allowed if 'na.rm' is FALSE
## Expecting something like:
# 1 NA
# 2 NA
# 3 NA
# 4 NA
# 5 NA
# 6 bin04
# 7 bin07
# 8 bin04
# 9 bin03
# 10 bin02
## Now repeating with step_cut()
step_cut(rec, v, breaks = c(70,80,90,100,110,120,130), include_outside_range = TRUE) %>%
prep() %>%
bake(new_data = NULL) %>%
print()
## Getting:
# Error in `step_cut()`:
# Caused by error in `if (min(var) < min(breaks)) ...`:
# ! missing value where TRUE/FALSE needed
## Expecting something like:
# 1 NA
# 2 NA
# 3 NA
# 4 NA
# 5 NA
# 6 bin04
# 7 bin07
# 8 bin04
# 9 bin03
# 10 bin02
In both the above functions, I believe the consistent application of na.rm=T
to the quantile()
, min()
and max()
functions should take you a long way towards fixing the problem.