healthcareai-r
healthcareai-r copied to clipboard
hcai_impute param names
This awesome function takes nominal
or numeric
params.
Ordinal type cols fall between nominal
and numeric
and it isn't clear to the user which to use (since behavior changes from knn to bag).
After standardizing, change name to numer_ord_method
or provide some note in the docs?
For knnimpute with ordinal column having NAs:
Only nominal_method
works (not numeric_method
), and fills in without new categories
numeric_method = 'knnimpute'
yields error:
Error in gower_work(x = x, y = y, pair_x = pair_x, pair_y = pair_y, n = n, :
STRING_ELT() can only be applied to a 'character vector', not a 'integer'
Repro steps:
set.seed(9)
n = 100
d <- tibble::tibble(patient_id = 1:n,
age = sample(c(30:80, NA), size = n, replace = TRUE),
hemoglobin_count = rnorm(n, mean = 15, sd = 1),
hemoglobin_category = sample(c("Low", "Normal", "High", NA),
size = n, replace = TRUE),
disease = ifelse(hemoglobin_count < 15, "Yes", "No"))
my_recipe <- recipe(disease ~ ., data = d)
# Create recipe
my_recipe <- my_recipe %>%
hcai_impute(nominal_method = 'knnimpute')
my_recipe
# Train recipe
trained_recipe <- prep(my_recipe, training = d)
# Apply recipe
d_out <- bake(trained_recipe, newdata = d)
d
d_out
For bagimpute with ordinal column having NAs:
Only numeric_method
works (not nominal_method
), and fills in with new categories
nominal_method = 'bagimpute'
does not give error, but doesn't fill NAs
set.seed(9)
n = 100
d <- tibble::tibble(patient_id = 1:n,
age = sample(c(30:80, NA), size = n, replace = TRUE),
hemoglobin_count = rnorm(n, mean = 15, sd = 1),
hemoglobin_category = sample(c("Low", "Normal", "High", NA),
size = n, replace = TRUE),
disease = ifelse(hemoglobin_count < 15, "Yes", "No"))
my_recipe <- recipe(disease ~ ., data = d)
# Create recipe
my_recipe <- my_recipe %>%
hcai_impute(numeric_method = 'bagimpute')
my_recipe
# Train recipe
trained_recipe <- prep(my_recipe, training = d)
# Apply recipe
d_out <- bake(trained_recipe, newdata = d)
d
d_out
Thanks for thinking through this and documenting it well @levithatcher. Connected to #857. This is @mmastand's awesome function, so he should weigh in.
Thinking about how we handle them in training and prediction: My inclination is to treat them as their underlying integer representation. That works extremely well for tree-based methods, and while it requires the assumption of equal effect across levels to use them this way for regression-based methods, the alternative of using them as factors means you give up the orderedness altogether, and you end up with estimates on the dummies rather than the variable as a whole.
I'd be inclined to treat them as numerics for imputation too. That's easy -- we can convert them at the top of functions, and if we want to use them as numeric predictors, we need numeric values imputed.