recipes
recipes copied to clipboard
Step_integer() documentation and use with unseen data
The step_integer documentation states:
Description
step_integer()
creates a specification of a recipe step that will convert new data into a set of integers based on the original data values.
Niavely, I thought that meant each value would be replaced by its integer truncation as I was looking for some data-type conversion steps. I made this mistake because I did not read (more correctly, had forgotten) the Details section which explains things fully. A (much) better description would be:
step_integer()
creates a specification of a recipe step that will convert data into a set of ascending integers based on the ascending order of the original data values.
Once the true nature of the recipe step is made clear, users can ask themselves whether unseen observations can ever be truly passed through this step sensibly.
The code below shows that observations, that are not part of the 50 training cases, are given the integer value of zero.
train <- iris %>%
nrow() %>%
sample.int(size =50) %>%
iris[.,]
train %>%
recipes::recipe() %>%
recipes::step_integer(Sepal.Length) %>%
recipes::prep(strings_as_factors = FALSE) %>%
recipes::bake(new_data = iris) %>%
View()
Unless I misunderstand something this recipe step is fundamentally flawed as a step that can process unseen data. The zero_based
parameter does not address this problem either. If the goal is to replace the variable with its rank order, then new observations can never be processed sensibly, (as things stand) since neither 0 nor max+1 are sensible values for new observations.
Perhaps the description should read:
step_integer()
creates a specification of a recipe step that will convert data into a set of ascending integers based on the ascending order of the original data values. Its strict validity is limited to its training data alone.
I hope I am not being a moaner by raising this. I use recipe()
all the time and appreciate the work of others. I do fear this recipe step will cause more harm than good as it is so easy to misuse.