mlr3filters icon indicating copy to clipboard operation
mlr3filters copied to clipboard

FilterAUC and missing values

Open mllg opened this issue 6 years ago • 9 comments

FilterAUC operates on features with missing values by just ranking the missing values last (default in rank()). I'm not sure that this is statistically sound.

I'd suggest removing them and calculate the AUC on the remaining observations.

@berndbischl @pat-s ?

mllg avatar Oct 30 '19 18:10 mllg

Could this also apply to more filters than just FilterAUC?

pat-s avatar Oct 30 '19 20:10 pat-s

why dont we throw an error? if NAs are there. also we seem to be missing a generic test. and we need to clearly doc / decide what happens in these cases for all filters

berndbischl avatar Oct 30 '19 22:10 berndbischl

i guess ignoring the NA-based obs in the calculation is "cleanest" and most robust as michel suggested. but we should probably then do this in a global place, unit test this properly and also document it visibly

berndbischl avatar Oct 30 '19 23:10 berndbischl

a3e43f9ebe9f79059bfc8b423f8583a7b3c12a94 replaced Metrics and ignores NAs, but we still need tests and check the behaviour of other filters.

mllg avatar Nov 01 '19 18:11 mllg

Ignoring Nas is actually wrong after thinking about this more pls don't merge / release this without further discussion

berndbischl avatar Nov 01 '19 20:11 berndbischl

If you have a feature with 98% missing values and for the remainder there is a high or perfect correlation with the target that feature would get a very high score. That's wrong?

Nas should be an error for filters. And users should transparently impute them.

Agreed?

berndbischl avatar Nov 01 '19 20:11 berndbischl

@mllg Looking at ?mlr3measures::auc(), the NA value is NaN. Is this something you added in the meantime which fixes the initial issue or does this have the same effect? (i.e. ordering the NaN features last).

pat-s avatar Oct 19 '20 13:10 pat-s

@mllg Looking at ?mlr3measures::auc(), the NA value is NaN. Is this something you added in the meantime which fixes the initial issue or does this have the same effect? (i.e. ordering the NaN features last).

NaN is the return value if you cannot calculate the measure (div/0 etc). Having NA in truth or response always results in an error.

For filters, Bernd suggested throwing an error. I assume this is the safest way to deal with this. If we want to allow missing values by just removing them (as FilterVariance currently does), this should not be the default behavior.

mllg avatar Oct 19 '20 20:10 mllg

Right now it seems like NAs are removed prior to score calculation

https://github.com/mlr-org/mlr3filters/blob/13988d396181c3d75cb344f5ce1f11d11e5a3910/R/FilterAUC.R#L42-L46

I would add an assertion which checks for NAs in any feature and apply this to every filter with a descriptive error message to use a pipeop to impute these values?

pat-s avatar Oct 20 '20 05:10 pat-s