dask icon indicating copy to clipboard operation
dask copied to clipboard

to_parquet fails for nullable dtype index

Open bnaul opened this issue 3 years ago • 5 comments

git bisect traced this back to #9131:

import dask.dataframe as dd, pandas as pd

dd.from_pandas(pd.DataFrame({"a": [1, 2]}, index=pd.Index(["A", "B"], dtype="string")), npartitions=1).to_parquet("/tmp/to_parquet_test/")


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-10-d12a786a20e0> in <module>
      1 import dask.dataframe as dd, pandas as pd
----> 2 dd.from_pandas(pd.DataFrame({"a": [1, 2]}, index=pd.Index(["A", "B"], dtype="string")), npartitions=1).to_parquet("/tmp/to_parquet_test/")

~/model/.venv/lib/python3.9/site-packages/dask/dataframe/core.py in to_parquet(self, path, *args, **kwargs)

~/model/.venv/lib/python3.9/site-packages/dask/dataframe/io/parquet/core.py in to_parquet(df, path, engine, compression, write_index, append, overwrite, ignore_divisions, partition_on, storage_options, custom_metadata, write_metadata_file, compute, compute_kwargs, schema, name_function, **kwargs)

~/model/.venv/lib/python3.9/site-packages/dask/dataframe/io/parquet/arrow.py in initialize_write(cls, df, fs, path, append, partition_on, ignore_divisions, division_info, schema, index_cols, **kwargs)

~/model/.venv/lib/python3.9/site-packages/dask/utils.py in __call__(self, arg, *args, **kwargs)

~/model/.venv/lib/python3.9/site-packages/dask/dataframe/backends.py in get_pyarrow_schema_pandas(obj)

~/model/.venv/lib/python3.9/site-packages/pyarrow/types.pxi in pyarrow.lib.Schema.from_pandas()

~/model/.venv/lib/python3.9/site-packages/pyarrow/pandas_compat.py in dataframe_to_types(df, preserve_index, columns)
    527             type_ = pa.array(c, from_pandas=True).type
    528         elif _pandas_api.is_extension_array_dtype(values):
--> 529             type_ = pa.array(c.head(0), from_pandas=True).type
    530         else:
    531             values, type_ = get_datetimetz_type(values, c.dtype, None)

AttributeError: 'Index' object has no attribute 'head'

Removing dtype="string" fixes the issue.

I can't quite make out whether the mistaken assumption is in dask or pyarrow...? But regardless since it worked prior to #9131 it seems like dask should be able to work around the issue.

cc @jcrist

bnaul avatar Jun 14 '22 16:06 bnaul

Thanks for surfacing @bnaul. I'm able to reproduce using the latest pyarrow=8.0.0 and pandas=1.4.2 releases (with the current dask main branch).

I can't quite make out whether the mistaken assumption is in dask or pyarrow...?

I'm able to reproduce with pyarrow alone (see snippet below) which makes me think the underlying issue is related to pyarrows handling of DataFrame's whose index are using extension dtypes. cc @jorisvandenbossche for visibility

In [1]: import pandas as pd

In [2]: import pyarrow as pa

In [3]: df = pd.DataFrame({"a": [1, 2]}, index=pd.Index(["A", "B"], dtype="string"))

In [4]: pa.Schema.from_pandas(df)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [4], in <cell line: 1>()
----> 1 pa.Schema.from_pandas(df)

File ~/mambaforge/envs/dask/lib/python3.10/site-packages/pyarrow/types.pxi:1663, in pyarrow.lib.Schema.from_pandas()

File ~/mambaforge/envs/dask/lib/python3.10/site-packages/pyarrow/pandas_compat.py:529, in dataframe_to_types(df, preserve_index, columns)
    527     type_ = pa.array(c, from_pandas=True).type
    528 elif _pandas_api.is_extension_array_dtype(values):
--> 529     type_ = pa.array(c.head(0), from_pandas=True).type
    530 else:
    531     values, type_ = get_datetimetz_type(values, c.dtype, None)

AttributeError: 'Index' object has no attribute 'head'

Removing dtype="string" fixes the issue.

This is also true when just using pyarrow

In [5]: df = pd.DataFrame({"a": [1, 2]}, index=pd.Index(["A", "B"]))

In [6]: pa.Schema.from_pandas(df)
Out[6]:
a: int64
__index_level_0__: string
-- schema metadata --
pandas: '{"index_columns": ["__index_level_0__"], "column_indexes": [{"na' + 430

jrbourbeau avatar Jun 14 '22 19:06 jrbourbeau

Have a possible workaround I'll push up in a bit...

jrbourbeau avatar Jun 14 '22 19:06 jrbourbeau

Thanks @jrbourbeau, I am seeing that passing schema=None seems to revert to the old behavior, so I guess the change to that default argument is really what surfaced this.

bnaul avatar Jun 14 '22 19:06 bnaul

So it turns out while

df = pd.DataFrame({"a": [1, 2]}, index=pd.Index(["A", "B"], dtype="string"))
pa.Schema.from_pandas(df)

doesn't work, creating a pyarrow.Table and then extracting the schema with

df = pd.DataFrame({"a": [1, 2]}, index=pd.Index(["A", "B"], dtype="string"))
pa.Table.from_pandas(df).schema

does kind of work. It doesn't raise an exception but also doesn't preserve the string[python] extension dtype for the index. Instead it converts to object.

jrbourbeau avatar Jun 14 '22 21:06 jrbourbeau

Reported upstream here. I think changing to schema=None (or supplying a manual schema to ddf.to_parquet) is the correct workaround until it is fixed upstream.

ian-r-rose avatar Jun 15 '22 17:06 ian-r-rose

Closing this issue as resolved by https://github.com/apache/arrow/pull/14080. When using the latest nightly version of pyarrow the original example now passes. Thanks for reporting @bnaul

jrbourbeau avatar Sep 28 '22 14:09 jrbourbeau