woodwork icon indicating copy to clipboard operation
woodwork copied to clipboard

Differences in type inference between pandas, dask, and koalas

Open davesque opened this issue 3 years ago • 0 comments

I recently was having issues with this section of code in TypeSystem.infer_logical_type: https://github.com/alteryx/woodwork/blob/0c3978f1d2ce1e1c6433421ca172330277a5f1a0/woodwork/type_sys/type_system.py#L260-L263

The problem was that this test was initially failing when I added it, but only for dask: https://github.com/alteryx/woodwork/blob/0c3978f1d2ce1e1c6433421ca172330277a5f1a0/woodwork/tests/type_system/test_ltype_inference.py#L95

I eventually realized it was because the failing record I was hoping to detect was ending up in the second partition of the test dask series. Since infer_logical_type discards all except the first partition of a dask series when determining a pandas dataframe to use for type inference, the offending record never made it into the inference series and the above test that checked for inference failure didn't pass.

It made me feel like it's generally a bit confusing that the inference processes for pandas vs. dask vs. koalas could yield different results for the same dataset. It seems like dask and spark generally must organize their data internally such that it should be possible to retrieve the same records from a similarly partitioned dataset in either a dask or a spark cluster. However, I'm less familiar with the koalas API. Koalas might not even give any access to its data by partition.

Anyhow, just thought I'd flag the issue here in case we want to eventually take a closer look at it.

davesque avatar Jul 13 '21 17:07 davesque