dask
dask copied to clipboard
to_parquet fails for nullable dtype index
git bisect traced this back to #9131:
import dask.dataframe as dd, pandas as pd
dd.from_pandas(pd.DataFrame({"a": [1, 2]}, index=pd.Index(["A", "B"], dtype="string")), npartitions=1).to_parquet("/tmp/to_parquet_test/")
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-10-d12a786a20e0> in <module>
1 import dask.dataframe as dd, pandas as pd
----> 2 dd.from_pandas(pd.DataFrame({"a": [1, 2]}, index=pd.Index(["A", "B"], dtype="string")), npartitions=1).to_parquet("/tmp/to_parquet_test/")
~/model/.venv/lib/python3.9/site-packages/dask/dataframe/core.py in to_parquet(self, path, *args, **kwargs)
~/model/.venv/lib/python3.9/site-packages/dask/dataframe/io/parquet/core.py in to_parquet(df, path, engine, compression, write_index, append, overwrite, ignore_divisions, partition_on, storage_options, custom_metadata, write_metadata_file, compute, compute_kwargs, schema, name_function, **kwargs)
~/model/.venv/lib/python3.9/site-packages/dask/dataframe/io/parquet/arrow.py in initialize_write(cls, df, fs, path, append, partition_on, ignore_divisions, division_info, schema, index_cols, **kwargs)
~/model/.venv/lib/python3.9/site-packages/dask/utils.py in __call__(self, arg, *args, **kwargs)
~/model/.venv/lib/python3.9/site-packages/dask/dataframe/backends.py in get_pyarrow_schema_pandas(obj)
~/model/.venv/lib/python3.9/site-packages/pyarrow/types.pxi in pyarrow.lib.Schema.from_pandas()
~/model/.venv/lib/python3.9/site-packages/pyarrow/pandas_compat.py in dataframe_to_types(df, preserve_index, columns)
527 type_ = pa.array(c, from_pandas=True).type
528 elif _pandas_api.is_extension_array_dtype(values):
--> 529 type_ = pa.array(c.head(0), from_pandas=True).type
530 else:
531 values, type_ = get_datetimetz_type(values, c.dtype, None)
AttributeError: 'Index' object has no attribute 'head'
Removing dtype="string" fixes the issue.
I can't quite make out whether the mistaken assumption is in dask or pyarrow...? But regardless since it worked prior to #9131 it seems like dask should be able to work around the issue.
cc @jcrist
Thanks for surfacing @bnaul. I'm able to reproduce using the latest pyarrow=8.0.0 and pandas=1.4.2 releases (with the current dask main branch).
I can't quite make out whether the mistaken assumption is in dask or pyarrow...?
I'm able to reproduce with pyarrow alone (see snippet below) which makes me think the underlying issue is related to pyarrows handling of DataFrame's whose index are using extension dtypes. cc @jorisvandenbossche for visibility
In [1]: import pandas as pd
In [2]: import pyarrow as pa
In [3]: df = pd.DataFrame({"a": [1, 2]}, index=pd.Index(["A", "B"], dtype="string"))
In [4]: pa.Schema.from_pandas(df)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Input In [4], in <cell line: 1>()
----> 1 pa.Schema.from_pandas(df)
File ~/mambaforge/envs/dask/lib/python3.10/site-packages/pyarrow/types.pxi:1663, in pyarrow.lib.Schema.from_pandas()
File ~/mambaforge/envs/dask/lib/python3.10/site-packages/pyarrow/pandas_compat.py:529, in dataframe_to_types(df, preserve_index, columns)
527 type_ = pa.array(c, from_pandas=True).type
528 elif _pandas_api.is_extension_array_dtype(values):
--> 529 type_ = pa.array(c.head(0), from_pandas=True).type
530 else:
531 values, type_ = get_datetimetz_type(values, c.dtype, None)
AttributeError: 'Index' object has no attribute 'head'
Removing dtype="string" fixes the issue.
This is also true when just using pyarrow
In [5]: df = pd.DataFrame({"a": [1, 2]}, index=pd.Index(["A", "B"]))
In [6]: pa.Schema.from_pandas(df)
Out[6]:
a: int64
__index_level_0__: string
-- schema metadata --
pandas: '{"index_columns": ["__index_level_0__"], "column_indexes": [{"na' + 430
Have a possible workaround I'll push up in a bit...
Thanks @jrbourbeau, I am seeing that passing schema=None seems to revert to the old behavior, so I guess the change to that default argument is really what surfaced this.
So it turns out while
df = pd.DataFrame({"a": [1, 2]}, index=pd.Index(["A", "B"], dtype="string"))
pa.Schema.from_pandas(df)
doesn't work, creating a pyarrow.Table and then extracting the schema with
df = pd.DataFrame({"a": [1, 2]}, index=pd.Index(["A", "B"], dtype="string"))
pa.Table.from_pandas(df).schema
does kind of work. It doesn't raise an exception but also doesn't preserve the string[python] extension dtype for the index. Instead it converts to object.
Reported upstream here. I think changing to schema=None (or supplying a manual schema to ddf.to_parquet) is the correct workaround until it is fixed upstream.
Closing this issue as resolved by https://github.com/apache/arrow/pull/14080. When using the latest nightly version of pyarrow the original example now passes. Thanks for reporting @bnaul