recipes
recipes copied to clipboard
Standardize names for steps that create dummy variables
The use of dummy in step names have lead to some confusion, especially with the addition of step_dummy_multi_choice() and step_dummy_extract() which has dummy as a part of their name, while other steps such as step_regex(), step_count(), step_indicate_na(), and step_holiday() which do produce dummies, does not.
Before I go any further I'm going to lay down some terminology.
- Dummy variable: A numeric variable that only takes the value 0 or 1 that indicates a categorical effect. (also known as indicator variable)
- Count variable: A numeric variable that indicate number of occurrences. Can any whole number.
Using the above definition I will say that
step_dummy()produces a set of dummy variables.step_dummy_multi_choice()produces a set of dummy variables.step_holiday()produces a set of dummy variables.step_dummy_extract()produces a set of count variables.step_indicate_na()produces a single dummy variable.step_regex()produces a single dummy variable.step_count()produces a single count variable (whennormalize = FALSE)
A way to standardize the naming would be to turn step_holiday() -> step_dummy_holiday(), step_dummy_regex(), etc, etc.
not all dummy steps can have a related count step, but all count steps can have a related dummy step.
What I'm not sure what to do naming wise for steps that produces counts, as it is only step_count() and step_dummy_extract(). step_dummy_extract() could in theory be changed to return dummies instead of counts, and create another step called step_count_extract() that does what step_dummy_extract() does now.
All the above a using a somehow loose definition of categorical effect.
Two that are especially confusingly named right now are step_dummy_extract() and step_regex(). One option could be:
step_regex()➡️step_detect_regex()step_dummy_extract()➡️step_extract_regex()step_count()➡️step_count_regex()
I feel like step_dummy_multi_choice() could be better. One option would be step_dummy_coalesce()?
I'd lean toward keeping step_holiday() as is and then adjusting the title to say it makes dummy variables:
Generate dummy variables for holidays
Another verb we have going on here is indicate that in my mind is very similar to detect. It would be nice to avoid keeping both around
I feel like step_dummy_multi_choice() could be better. One option would be step_dummy_coalesce()?
I like coalesce!
overall a good idea. But I don't think the benefit of unifying these function names outweigh the annoyance we would get for changing them.
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.