healthcareai-r icon indicating copy to clipboard operation
healthcareai-r copied to clipboard

hcai_impute param names

Open levithatcher opened this issue 7 years ago • 1 comments

This awesome function takes nominal or numeric params.

Ordinal type cols fall between nominal and numeric and it isn't clear to the user which to use (since behavior changes from knn to bag).

After standardizing, change name to numer_ord_method or provide some note in the docs?

For knnimpute with ordinal column having NAs:

Only nominal_method works (not numeric_method), and fills in without new categories

numeric_method = 'knnimpute' yields error:

Error in gower_work(x = x, y = y, pair_x = pair_x, pair_y = pair_y, n = n,  : 
  STRING_ELT() can only be applied to a 'character vector', not a 'integer'

Repro steps:

set.seed(9)
n = 100
d <- tibble::tibble(patient_id = 1:n,
                    age = sample(c(30:80, NA), size = n, replace = TRUE),
                    hemoglobin_count = rnorm(n, mean = 15, sd = 1),
                    hemoglobin_category = sample(c("Low", "Normal", "High", NA),
                                                 size = n, replace = TRUE),
                    disease = ifelse(hemoglobin_count < 15, "Yes", "No"))

my_recipe <- recipe(disease ~ ., data = d)

# Create recipe
my_recipe <- my_recipe %>%
  hcai_impute(nominal_method = 'knnimpute')
my_recipe

# Train recipe
trained_recipe <- prep(my_recipe, training = d)

# Apply recipe
d_out <- bake(trained_recipe, newdata = d)

d
d_out

For bagimpute with ordinal column having NAs:

Only numeric_method works (not nominal_method), and fills in with new categories

nominal_method = 'bagimpute' does not give error, but doesn't fill NAs

set.seed(9)
n = 100
d <- tibble::tibble(patient_id = 1:n,
                    age = sample(c(30:80, NA), size = n, replace = TRUE),
                    hemoglobin_count = rnorm(n, mean = 15, sd = 1),
                    hemoglobin_category = sample(c("Low", "Normal", "High", NA),
                                                 size = n, replace = TRUE),
                    disease = ifelse(hemoglobin_count < 15, "Yes", "No"))

my_recipe <- recipe(disease ~ ., data = d)

# Create recipe
my_recipe <- my_recipe %>%
  hcai_impute(numeric_method = 'bagimpute')
my_recipe

# Train recipe
trained_recipe <- prep(my_recipe, training = d)

# Apply recipe
d_out <- bake(trained_recipe, newdata = d)

d
d_out

levithatcher avatar Jan 30 '18 22:01 levithatcher

Thanks for thinking through this and documenting it well @levithatcher. Connected to #857. This is @mmastand's awesome function, so he should weigh in.

Thinking about how we handle them in training and prediction: My inclination is to treat them as their underlying integer representation. That works extremely well for tree-based methods, and while it requires the assumption of equal effect across levels to use them this way for regression-based methods, the alternative of using them as factors means you give up the orderedness altogether, and you end up with estimates on the dummies rather than the variable as a whole.

I'd be inclined to treat them as numerics for imputation too. That's easy -- we can convert them at the top of functions, and if we want to use them as numeric predictors, we need numeric values imputed.

michaellevy avatar Jan 30 '18 23:01 michaellevy