woodwork Replace float64 with Float64Dtype in pandas

pandas has merged in the new nullable FloatDtype
- https://github.com/pandas-dev/pandas/pull/34307
It will be in the 1.2.0 release
- https://github.com/pandas-dev/pandas/blob/master/doc/source/whatsnew/v1.2.0.rst#experimental-nullable-data-types-for-float-data
Once it is released, Woodwork can take advantage of it, and use it. This will further us towards having 1 representation of NaN in DataTable

Oct 14 '20 22:10 gsheni

EvalML is currently adding support for pandas 1.2.0

Feb 02 '21 16:02 gsheni

EvalML now supports pandas 1.2.0: https://github.com/alteryx/evalml/commit/9576d5da195cfc46767e664c5562833a4dd5b83b#diff-e7031ce8aee6d7dc175631195661f5f893bfa3614e5f63ec93c15d2d59235667L2

Mar 17 '21 19:03 gsheni

Blocked until Koalas fixes the 1.2.0 restriction: https://github.com/databricks/koalas/issues/2137

Apr 15 '21 20:04 gsheni

One thought is that instead of changing the underlying dtype for Double, we could add a Logical Type DoubleNullable with a dtype of Float64Dtype. We would keep Double and float64 as is, and make it the default inferred type. So a user would have to explicitly set DoubleNullable for a column.

This way we avoid causing downstream problems with the new Float64Dtype.

Thoughts @freddyaboulton @thehomebrewnerd @tamargrey ?

Apr 29 '21 19:04 gsheni

I thought about this a bit as well. My main hesitation is that I'm not sure the Double and DoubleNullable names work quite as cleanly as Integer and IntegerNullable as both double logical types would be able to accept null values. Not sure what would be better at the moment though.

I also wonder if we should do this now or just wait a while longer until the Float64Dtype is no longer problematic? Maybe once the downstream problems are resolved (assuming we get to that point) we could just make one update to change Integer, Boolean and Double all use the new dtypes and drop the old non-nullable versions? It would be a bit strange to leave the double version out in the short term though since we have support for the others.

I'm rambling a bit...which means I'm undecided and don't have a strong opinion either way...

Apr 29 '21 19:04 thehomebrewnerd

I think I'd vote for not having Float64Dtype at all over having a Double and DoubleNullable. Maybe there's another name that better describes the relationship between the two potential Logical Types?

Double and DoubleNewDtype is definitely a different naming convention, though, and I don't think it's the end of the world to not have any logical type that uses the new dtype.

Apr 29 '21 21:04 tamargrey

Alright, let's icebox this for now and close the Float64Dtype MR. It may cause un-necessary problems downstream, and we can re-visit once downstream libraries update to support this new Dtype.

We can also revisit if we find a compelling use-case for it.

Apr 29 '21 21:04 gsheni

woodwork woodwork copied to clipboard

Replace float64 with Float64Dtype in pandas

woodwork
woodwork copied to clipboard