dask-ml
dask-ml copied to clipboard
Fix for issue #567
SimpleImputer now tries to convert non-numeric columns of dataframes to numeric types, both more closely matching sklearn's SimpleImputer's behavior, and provides a more clear error if non-numeric data cannot be converted.
Thanks for the PR @Bonesters!
more closely matching sklearn's SimpleImputer's behavior, and provides a more clear error if non-numeric data cannot be converted.
Could you add a test that confirms this? That will make sure future versions have this behavior.
(this will close #567 as mentioned in the title)
I think I have the test working. It looks like the behavior difference was only happening when passed a dataframe.
It should be all set now.
Still thinking through this...
I don't particularly like the though of .astypeing without the user asking us to. I see that's what scikit-learn does, but that's likely due to their limited support for pandas. In particular, I'm concerned that .astypeing to a numeric dtype will mess with extension arrays that support reductions.
What if we instead check that the size of statistics / avg match the number of input columns? That would solve the issue of columns silently being dropped because of pandas behavior.