healthcareai-r icon indicating copy to clipboard operation
healthcareai-r copied to clipboard

step to transform columns with mostly missing values to a factor

Open mmastand opened this issue 7 years ago • 4 comments

Say there is a test that is rarely administered. The results are in a column that is 80% null. The value may not be as important as simply having the test performed. You wouldn't want to impute that, as you would lose the value of knowing whether or not a test was performed.

Create a step_mostly_missing_to_factor following the format of step_hcai_missing to find columns with mostly NA values and replace them with a binary Y/N column.

mmastand avatar Jan 08 '18 22:01 mmastand

@NateGarrettHC how does this one look? @mmastand mentioned that this might be a good one for you to work on.

glenrs avatar Oct 04 '18 21:10 glenrs

@glenrs @mmastand Sounds good! I'll start working on it.

NateGarrettHC avatar Oct 05 '18 15:10 NateGarrettHC

@mmastand Hey Mike, a couple questions about this: First, what kind of threshold do you want to consider it "mostly missing"? Do you want it to be 80% like in your example? Or do you want it to be like the step_hcai_missing and have it as an impute option for any amount of missingness? I see a placeholder in prep_data for this step, so do you want it to be something outside impute? Does it need an added parameter in prep_data for people to toggle on or off? What were you thinking? Second, do you want the values in the column to be replace where non-null values become a "Y" and null values become a "N"? Or do you want that column removed and a new one created? I wasn't sure what you mean by "replace them with a binary Y/N column". Replace the values in the column or the column itself?

NateGarrettHC avatar Oct 16 '18 15:10 NateGarrettHC

To flesh this out more, I think the best way to do it would be:

  • Added parameter to prep_data
  • Allow user to specify TRUE or the value, default to 80% missing
  • Can get set to FALSE to turn off
  • Happens before imputation
  • New column is created, named, column_name_present (example: lactate_present).
  • Non-null values go to "Y", null goes to "N"
  • Original column is removed.

mmastand avatar Nov 13 '18 18:11 mmastand