healthcareai-r
healthcareai-r copied to clipboard
step to transform columns with mostly missing values to a factor
Say there is a test that is rarely administered. The results are in a column that is 80% null. The value may not be as important as simply having the test performed. You wouldn't want to impute that, as you would lose the value of knowing whether or not a test was performed.
Create a step_mostly_missing_to_factor
following the format of step_hcai_missing
to find columns with mostly NA values and replace them with a binary Y/N column.
@NateGarrettHC how does this one look? @mmastand mentioned that this might be a good one for you to work on.
@glenrs @mmastand Sounds good! I'll start working on it.
@mmastand Hey Mike, a couple questions about this:
First, what kind of threshold do you want to consider it "mostly missing"? Do you want it to be 80% like in your example? Or do you want it to be like the step_hcai_missing
and have it as an impute option for any amount of missingness? I see a placeholder in prep_data for this step, so do you want it to be something outside impute? Does it need an added parameter in prep_data for people to toggle on or off? What were you thinking?
Second, do you want the values in the column to be replace where non-null values become a "Y" and null values become a "N"? Or do you want that column removed and a new one created? I wasn't sure what you mean by "replace them with a binary Y/N column". Replace the values in the column or the column itself?
To flesh this out more, I think the best way to do it would be:
- Added parameter to prep_data
- Allow user to specify
TRUE
or the value, default to 80% missing - Can get set to
FALSE
to turn off - Happens before imputation
- New column is created, named,
column_name_present
(example:lactate_present
). - Non-null values go to "Y", null goes to "N"
- Original column is removed.