dask-ml icon indicating copy to clipboard operation
dask-ml copied to clipboard

Fix for issue #567

Open Bonesters opened this issue 6 years ago • 4 comments
trafficstars

SimpleImputer now tries to convert non-numeric columns of dataframes to numeric types, both more closely matching sklearn's SimpleImputer's behavior, and provides a more clear error if non-numeric data cannot be converted.

Bonesters avatar Oct 30 '19 00:10 Bonesters

Thanks for the PR @Bonesters!

more closely matching sklearn's SimpleImputer's behavior, and provides a more clear error if non-numeric data cannot be converted.

Could you add a test that confirms this? That will make sure future versions have this behavior.

(this will close #567 as mentioned in the title)

stsievert avatar Oct 30 '19 01:10 stsievert

I think I have the test working. It looks like the behavior difference was only happening when passed a dataframe.

Bonesters avatar Oct 30 '19 01:10 Bonesters

It should be all set now.

Bonesters avatar Oct 30 '19 04:10 Bonesters

Still thinking through this...

I don't particularly like the though of .astypeing without the user asking us to. I see that's what scikit-learn does, but that's likely due to their limited support for pandas. In particular, I'm concerned that .astypeing to a numeric dtype will mess with extension arrays that support reductions.

What if we instead check that the size of statistics / avg match the number of input columns? That would solve the issue of columns silently being dropped because of pandas behavior.

TomAugspurger avatar Oct 30 '19 15:10 TomAugspurger