pyod
pyod copied to clipboard
How to tackle missing value and new type value in input
Hi, thanks for your contribution of pyod. It helps a lot. It seems that the framework cannot tackle the missing value problem. When I use OrdinalEncoder to encode the input, the missing values and unknown_values are set as np.nan, then pyod either outputs all nan value or return the error: ValueError: Input contains NaN, infinity or a value too large for dtype('float32'). I have tried ECOD and IForest. Could you help me solve the problem? Or tell me what algorithms in Pyod can tackle the problem?
you can replace it by 0 or whichever you want https://numpy.org/doc/stable/reference/generated/numpy.nan_to_num.html
Just like sklearn, pyod does not handle missing values. Users are welcomed to convert the data to fully numerical format by themselves :)
Thanks for your reply. Since zero has been used for another category, I try -1 as missing value. But I see that the predicion (clf.decision_function) are almost all negative values. Is it ok?
I have also tried np.inf, but i meet the following problem. ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
Missing value handling is a whole area of research. Approaches really rely on your use case.
If "data is missing" is not a necessary feature for your problem, I would recommend replacing the missing values with some mean/mode/median in order to not bias the algorithm towards these missing features.
If "data is missing" is a feature that may be useful for your algorithm, I would recommend using some kind of token. This token could be an impossible value (like -1, if only positive values are possible). Here I would take care that you do not significantly change the feature distribution, as the algorithm will tend to focus on this instead. E.G. If you have a feature normally distributed around 100 with a variance of only 1, I would not use -100 as a token. This is especially important for gradient methods (like neural networks).
However, these are just opinionated proposals and these techniques are not part of the package, as the package expects preprocessed values.