woodwork icon indicating copy to clipboard operation
woodwork copied to clipboard

Replace float64 with Float64Dtype in pandas

Open gsheni opened this issue 4 years ago • 7 comments

  • pandas has merged in the new nullable FloatDtype
    • https://github.com/pandas-dev/pandas/pull/34307
  • It will be in the 1.2.0 release
    • https://github.com/pandas-dev/pandas/blob/master/doc/source/whatsnew/v1.2.0.rst#experimental-nullable-data-types-for-float-data
  • Once it is released, Woodwork can take advantage of it, and use it. This will further us towards having 1 representation of NaN in DataTable

gsheni avatar Oct 14 '20 22:10 gsheni

EvalML is currently adding support for pandas 1.2.0

gsheni avatar Feb 02 '21 16:02 gsheni

EvalML now supports pandas 1.2.0: https://github.com/alteryx/evalml/commit/9576d5da195cfc46767e664c5562833a4dd5b83b#diff-e7031ce8aee6d7dc175631195661f5f893bfa3614e5f63ec93c15d2d59235667L2

gsheni avatar Mar 17 '21 19:03 gsheni

Blocked until Koalas fixes the 1.2.0 restriction: https://github.com/databricks/koalas/issues/2137

gsheni avatar Apr 15 '21 20:04 gsheni

One thought is that instead of changing the underlying dtype for Double, we could add a Logical Type DoubleNullable with a dtype of Float64Dtype. We would keep Double and float64 as is, and make it the default inferred type. So a user would have to explicitly set DoubleNullable for a column.

This way we avoid causing downstream problems with the new Float64Dtype.

Thoughts @freddyaboulton @thehomebrewnerd @tamargrey ?

gsheni avatar Apr 29 '21 19:04 gsheni

I thought about this a bit as well. My main hesitation is that I'm not sure the Double and DoubleNullable names work quite as cleanly as Integer and IntegerNullable as both double logical types would be able to accept null values. Not sure what would be better at the moment though.

I also wonder if we should do this now or just wait a while longer until the Float64Dtype is no longer problematic? Maybe once the downstream problems are resolved (assuming we get to that point) we could just make one update to change Integer, Boolean and Double all use the new dtypes and drop the old non-nullable versions? It would be a bit strange to leave the double version out in the short term though since we have support for the others.

I'm rambling a bit...which means I'm undecided and don't have a strong opinion either way...

thehomebrewnerd avatar Apr 29 '21 19:04 thehomebrewnerd

I think I'd vote for not having Float64Dtype at all over having a Double and DoubleNullable. Maybe there's another name that better describes the relationship between the two potential Logical Types?

Double and DoubleNewDtype is definitely a different naming convention, though, and I don't think it's the end of the world to not have any logical type that uses the new dtype.

tamargrey avatar Apr 29 '21 21:04 tamargrey

Alright, let's icebox this for now and close the Float64Dtype MR. It may cause un-necessary problems downstream, and we can re-visit once downstream libraries update to support this new Dtype.

We can also revisit if we find a compelling use-case for it.

gsheni avatar Apr 29 '21 21:04 gsheni