woodwork
woodwork copied to clipboard
Replace float64 with Float64Dtype in pandas
- pandas has merged in the new nullable FloatDtype
- https://github.com/pandas-dev/pandas/pull/34307
- It will be in the 1.2.0 release
- https://github.com/pandas-dev/pandas/blob/master/doc/source/whatsnew/v1.2.0.rst#experimental-nullable-data-types-for-float-data
- Once it is released, Woodwork can take advantage of it, and use it. This will further us towards having 1 representation of NaN in DataTable
EvalML is currently adding support for pandas 1.2.0
EvalML now supports pandas 1.2.0: https://github.com/alteryx/evalml/commit/9576d5da195cfc46767e664c5562833a4dd5b83b#diff-e7031ce8aee6d7dc175631195661f5f893bfa3614e5f63ec93c15d2d59235667L2
Blocked until Koalas fixes the 1.2.0 restriction: https://github.com/databricks/koalas/issues/2137
One thought is that instead of changing the underlying dtype for Double, we could add a Logical Type DoubleNullable
with a dtype of Float64Dtype
. We would keep Double and float64 as is, and make it the default inferred type. So a user would have to explicitly set DoubleNullable for a column.
This way we avoid causing downstream problems with the new Float64Dtype.
Thoughts @freddyaboulton @thehomebrewnerd @tamargrey ?
I thought about this a bit as well. My main hesitation is that I'm not sure the Double
and DoubleNullable
names work quite as cleanly as Integer
and IntegerNullable
as both double logical types would be able to accept null values. Not sure what would be better at the moment though.
I also wonder if we should do this now or just wait a while longer until the Float64Dtype
is no longer problematic? Maybe once the downstream problems are resolved (assuming we get to that point) we could just make one update to change Integer
, Boolean
and Double
all use the new dtypes and drop the old non-nullable versions? It would be a bit strange to leave the double version out in the short term though since we have support for the others.
I'm rambling a bit...which means I'm undecided and don't have a strong opinion either way...
I think I'd vote for not having Float64Dtype
at all over having a Double
and DoubleNullable
. Maybe there's another name that better describes the relationship between the two potential Logical Types?
Double
and DoubleNewDtype
is definitely a different naming convention, though, and I don't think it's the end of the world to not have any logical type that uses the new dtype.
Alright, let's icebox this for now and close the Float64Dtype
MR. It may cause un-necessary problems downstream, and we can re-visit once downstream libraries update to support this new Dtype.
We can also revisit if we find a compelling use-case for it.