pandas icon indicating copy to clipboard operation
pandas copied to clipboard

BUG: read_parquet converts pyarrow list type to numpy dtype

Open danielhanchen opened this issue 2 years ago • 31 comments

Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import pyarrow as pa
pyarrow_list_of_strings = pd.ArrowDtype(pa.list_(pa.string()))
data = pd.DataFrame({
    "Pyarrow" : pd.Series([["a"], ["a", "b"]], dtype = pyarrow_list_of_strings),
})
data.to_parquet("data.parquet") # SUCCESS
pd.read_parquet("data.parquet") # *** FAIL

data_object = pd.DataFrame({
    "Pyarrow" : pd.Series([["a"], ["a", "b"]], dtype = object),
})
data_object.to_parquet("data.parquet")
pyarrow_internal = pa.parquet.read_table("data.parquet") # SUCCESS with type list[string]
pyarrow_internal .to_pandas() # SUCCESS except object now

pd.Series(pd.arrays.ArrowExtensionArray(pyarrow_internal["Pyarrow"])) # SUCCESS - data-type also correct!

Issue Description

Great work on extending Arrow to Pandas! Using pd.ArrowDtype(pa.list_(pa.string())) or any other alteration works in the Parquet saving mode, but fails during the reading of the parquet file.

In fact, if there is a Pandas Series of pure lists of strings for eg ["a"], ["a", "b"], Parquet saves it internally as a list[string] type. When Pandas reads the parquet file, it then converts it to an object type.

Is there a way during the reading step to either:

  1. Convert the data-type like in the pure list mode to an object type OR
  2. pd.Series(pd.arrays.ArrowExtensionArray(x)) seems to actually work! Maybe during the conversion from the internal Pyarrow representation into Pandas, we can use pd.Series(pd.arrays.ArrowExtensionArray(x)) on columns which had errors? OR
  3. Somehow support these new types?

Expected Behavior

import pandas as pd
import pyarrow as pa
pyarrow_list_of_strings = pd.ArrowDtype(pa.list_(pa.string()))
data = pd.DataFrame({
    "Pyarrow" : pd.Series([["a"], ["a", "b"]], dtype = pyarrow_list_of_strings),
})
data.to_parquet("data.parquet") # SUCCESS
pd.read_parquet("data.parquet") # SUCCESS

Installed Versions

INSTALLED VERSIONS

commit : 37ea63d540fd27274cad6585082c91b1283f963d python : 3.11.3.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19045 machine : AMD64 processor : Intel64 Family 6 Model 30 Stepping 5, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_Australia.1252

pandas : 2.0.1 numpy : 1.24.3 pytz : 2023.3 dateutil : 2.8.2 setuptools : 67.7.2 pip : 23.1.2 Cython : 0.29.34 pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.12.0 pandas_datareader: None bs4 : 4.12.2 bottleneck : None brotli : fastparquet : None fsspec : None gcsfs : None matplotlib : 3.7.1 numba : 0.57.0rc1 numexpr : None odfpy : None openpyxl : 3.1.2 pandas_gbq : None pyarrow : 11.0.0 pyreadstat : None pyxlsb : None s3fs : None scipy : 1.10.1 snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : 2.0.1 zstandard : 0.21.0 tzdata : 2023.3 qtpy : None pyqt5 : None

danielhanchen avatar Apr 30 '23 10:04 danielhanchen

I found during the Pyarrow conversion, if you pass in a types_mapper and setting ignore_metadata to False, it works!

mapping = {schema.type : pd.ArrowDtype(schema.type) for schema in data.schema}
data.to_pandas(types_mapper = mapping.get, ignore_metadata = True)

danielhanchen avatar May 01 '23 13:05 danielhanchen

From the traceback, it appears that pyarrow tries to convert this type to a numpy dtype by default, so I think an appropriate fix would be for pyarrow to just return an ArrowDtype here

File /opt/miniconda3/envs/pandas-dev/lib/python3.10/site-packages/pyarrow/pandas_compat.py:812, in table_to_blockmanager(options, table, categories, ignore_metadata, types_mapper)
    809     table = _add_any_metadata(table, pandas_metadata)
    810     table, index = _reconstruct_index(table, index_descriptors,
    811                                       all_columns)
--> 812     ext_columns_dtypes = _get_extension_dtypes(
    813         table, all_columns, types_mapper)
    814 else:
    815     index = _pandas_api.pd.RangeIndex(table.num_rows)

File /opt/miniconda3/envs/pandas-dev/lib/python3.10/site-packages/pyarrow/pandas_compat.py:865, in _get_extension_dtypes(table, columns_metadata, types_mapper)
    860 dtype = col_meta['numpy_type']
    862 if dtype not in _pandas_supported_numpy_types:
    863     # pandas_dtype is expensive, so avoid doing this for types
    864     # that are certainly numpy dtypes
--> 865     pandas_dtype = _pandas_api.pandas_dtype(dtype)
    866     if isinstance(pandas_dtype, _pandas_api.extension_dtype):
    867         if hasattr(pandas_dtype, "__from_arrow__"):

File /opt/miniconda3/envs/pandas-dev/lib/python3.10/site-packages/pyarrow/pandas-shim.pxi:136, in pyarrow.lib._PandasAPIShim.pandas_dtype()

File /opt/miniconda3/envs/pandas-dev/lib/python3.10/site-packages/pyarrow/pandas-shim.pxi:139, in pyarrow.lib._PandasAPIShim.pandas_dtype()

File ~/.../pandas/core/dtypes/common.py:1626, in pandas_dtype(dtype)
   1621     with warnings.catch_warnings():
   1622         # GH#51523 - Series.astype(np.integer) doesn't show
   1623         # numpy deprecation warning of np.integer
   1624         # Hence enabling DeprecationWarning
   1625         warnings.simplefilter("always", DeprecationWarning)
-> 1626         npdtype = np.dtype(dtype)
   1627 except SyntaxError as err:
   1628     # np.dtype uses `eval` which can raise SyntaxError
   1629     raise TypeError(f"data type '{dtype}' not understood") from err

TypeError: data type 'list<item: string>[pyarrow]' not understood

mroeschke avatar May 01 '23 17:05 mroeschke

Hmm so I looked at the Pandas code, and not sure if using pd.ArrowDtype(dtype) will work.

The issue is data.schema.pandas_metadata['columns'][7]["numpy_type"] is a str and not an actual type object, and pd.ArrowDtype does not accept strings.

eg:

dt = A.schema.pandas_metadata['columns'][7]["numpy_type"]

returns:

'list<element: struct<rank: uint8, subtype: dictionary<values=string, indices=int32, ordered=0>, caption: string, credit: string, type: dictionary<values=string, indices=int32, ordered=0>, url: string, height: uint16, width: uint16, subType: dictionary<values=string, indices=int32, ordered=0>, crop_name: dictionary<values=string, indices=int32, ordered=0>>>[pyarrow]'

and using

pd.ArrowDtype(dt)

fails since it's a string.

I think the better approach would be to not just pass in data.schema.pandas_metadata['columns'][j]["numpy_type"] but also data.schema.types since it has the actual types which can be converted into a pd.ArrowDtype object.

danielhanchen avatar May 02 '23 18:05 danielhanchen

I think this behaves as expected. You can pass dtype_backend="pyarrow" to keep the list dtype

phofl avatar May 02 '23 21:05 phofl

@phofl

Oh oops I forgot to mention I tried pd.read_parquet(..., dtype_backend = "pyarrow"), and the TypeError still exists. The error is exactly the same, since it passes the dtype to np.dtype

danielhanchen avatar May 03 '23 03:05 danielhanchen

Confirmed it still fails:

import pandas as pd
import pyarrow as pa
pyarrow_list_of_strings = pd.ArrowDtype(pa.list_(pa.string()))
data = pd.DataFrame({
    "Pyarrow" : pd.Series([["a"], ["a", "b"]], dtype = pyarrow_list_of_strings),
})
data.to_parquet("data.parquet") # SUCCESS
pd.read_parquet("data.parquet", dtype_backend = "pyarrow") # *** FAIL

danielhanchen avatar May 03 '23 08:05 danielhanchen

Interesting,

This one works:

data = pd.DataFrame({
    "Pyarrow" : pd.Series([["a"], ["a", "b"]]),
})
data.to_parquet("data.parquet")
pd.read_parquet("data.parquet", dtype_backend = "pyarrow")

phofl avatar May 03 '23 08:05 phofl

Ye that works since it's an object - Pyarrow indeed saves the data inside the parquet file as list[string].

The issue is if you explicity parse list[string] directly, it does not work.

Ie:

data = pd.DataFrame({
    "Pyarrow" : pd.Series([["a"], ["a", "b"]]),
})
data.dtypes

returns

Pyarrow    object
dtype: object

danielhanchen avatar May 03 '23 09:05 danielhanchen

In fact the object schema is converted:

pa.parquet.read_table("data.parquet")

returns

pyarrow.Table
Pyarrow: list<item: string>
  child 0, item: string
----
Pyarrow: [[["a"],["a","b"]]]

danielhanchen avatar May 03 '23 09:05 danielhanchen

Maybe a try except so to not break other parts of the Pandas repo?

https://github.com/apache/arrow/blob/a77aab07b02b7d0dd6bd9c9a11c4af067d26b674/python/pyarrow/pandas_compat.py#L855

Maybe a try except so to not break other parts of the Pandas repo?

    # infer the extension columns from the pandas metadata
>>for col_meta, field in zip(columns_metadata, table.schema):
        try:
            name = col_meta['field_name']
        except KeyError:
            name = col_meta['name']
        dtype = col_meta['numpy_type']

        if dtype not in _pandas_supported_numpy_types:
            # pandas_dtype is expensive, so avoid doing this for types
            # that are certainly numpy dtypes
>>       try:
>>           pandas_dtype = _pandas_api.pandas_dtype(dtype)
>>       except:
>>           pandas_dtype = pd.ArrowDtype(field.type)
                
            if isinstance(pandas_dtype, _pandas_api.extension_dtype):
                if hasattr(pandas_dtype, "__from_arrow__"):
                    ext_columns[name] = pandas_dtype

danielhanchen avatar May 03 '23 09:05 danielhanchen

Run into the same issue:

df = pd.DataFrame({'a': pd.Series([['a'], ['a', 'b']], dtype=pd.ArrowDtype(pa.list_(pa.string())))})

df.to_parquet('test.parquet')  # SUCCESS
pd.read_parquet('test.parquet')  # *** FAIL

df.to_parquet('test.parquet')  # SUCCESS
pq.read_table('test.parquet').to_pandas(ignore_metadata=True, types_mapper=pd.ArrowDtype)  # SUCCESS

df.to_parquet('test.parquet', store_schema=False)  # SUCCESS
pd.read_parquet('test.parquet')  # SUCCESS

I think the last case was not mentioned so far.

takacsd avatar May 10 '23 21:05 takacsd

@takacsd oh interesting - so it's possible its the schema storing component that's wrong?

danielhanchen avatar May 11 '23 08:05 danielhanchen

@danielhanchen I think the problem is in the pandas specific metadata. If the parquet file was created with something else (e.g.: AWS Athena) it could read it just fine.

pq.write_table(pa.table({'a': pa.array([['a'], ['a', 'b']], type=pa.list_(pa.string()))}), 'test.parquet')  # SUCCESS
pd.read_parquet('test.parquet')  # SUCCESS

pq.write_table(pa.Table.from_pandas(df), 'test.parquet')  # SUCCESS
pd.read_parquet('test.parquet')  # *** FAIL

pq.write_table(pa.Table.from_pandas(df).replace_schema_metadata(), 'test.parquet') # SUCCESS
pd.read_parquet('test.parquet') # SUCCESS

This is the pandas metadata btw:

{'column_indexes': [{'field_name': None,
                     'metadata': {'encoding': 'UTF-8'},
                     'name': None,
                     'numpy_type': 'object',
                     'pandas_type': 'unicode'}],
 'columns': [{'field_name': 'a',
              'metadata': None,
              'name': 'a',
              'numpy_type': 'list<item: string>[pyarrow]',   # <---- this causes the error
              'pandas_type': 'list[unicode]'}],
 'creator': {'library': 'pyarrow', 'version': '11.0.0'},
 'index_columns': [{'kind': 'range',
                    'name': None,
                    'start': 0,
                    'step': 1,
                    'stop': 2}],
 'pandas_version': '2.0.1'}

In the case of a simple 'numpy_type': 'int64[pyarrow]' type everything works, so I suspect the _pandas_api.pandas_dtype(dtype) doesn't support complex types (yet).

takacsd avatar May 11 '23 12:05 takacsd

@takacsd oh yep your reasoning sounds right - so I think adding a simple try except might be a simple maybe? Try calling numpy then if it fails, call pd.ArrowDtype

danielhanchen avatar May 12 '23 13:05 danielhanchen

The main issue I think is becausedtype is a string I guess. I'm not 100% sure about how _pandas_api.pandas_dtype works, but presumably it's a large dict mapping types in string form to the correct type. Due to the infinite nature of possible Arrow datatypes, I guess its not feasible to update the dictionary, so maybe the try except solution is the only reasonable solution?

Just my two cents.

danielhanchen avatar May 12 '23 13:05 danielhanchen

The main issue I think is becausedtype is a string I guess. I'm not 100% sure about how _pandas_api.pandas_dtype works, but presumably it's a large dict mapping types in string form to the correct type.

It seems a little more complicated than that: pandas_dtype looks up a registry. This iterates trough the registered ExtensionDtypes, and tries to make sense of the string. The ExtensionDtype should understand it but doesn't, because pa.type_for_alias(base_type) only understands basic types.

We already have a special case for temporal types, so I suppose we just need something similar for arrays and maps...

takacsd avatar May 13 '23 08:05 takacsd

@takacsd The issue though timestamps can be reasonably easy to construct from text.

The below could all be possible though:

list[list[struct[int, float]]]
list[int]
struct[list[datetime]]

Constructing Arrow dtypes from that could be potentially problematic.

I guess in theory one can iterate through the string, and create a string which you can then call eval on ie:

list[struct[int32, string]] is pa.list_(pa.struct((pa.int32(), pa.string())) then you can eval on it.

I think a wiser approach would be to use the Arrow dtype from data.schema.types then call pd.ArrowDtype on it

danielhanchen avatar May 13 '23 09:05 danielhanchen

@danielhanchen your approach only works here, and it just ignores the metadata. I'm not a pandas developer but I suppose they generated that metadata for a reason, so it may break some things if we just ignore it.

Properly parsing the string is obviously harder, but I still think it is the better solution...

takacsd avatar May 13 '23 12:05 takacsd

@takacsd agreed parsing the metadata string is the correct way.

I thought about how one would go about doing it. Eg take: list<element: struct<rank: uint8, subtype: dictionary<values=string, indices=int32, ordered=0>, caption: string, credit: string, type: dictionary<values=string, indices=int32, ordered=0>, url: string, height: uint16, width: uint16, subType: dictionary<values=string, indices=int32, ordered=0>, crop_name: dictionary<values=string, indices=int32, ordered=0>>>[pyarrow]

You'll have to first find the type which has the first enclosed >, and continuously parse outwards. Ie if one makes a string converter, it'll have to find the inner-most enclosed data-type, then expand out, and encapsulate it with a while loop.

The while loop looks something like this:

left_pointer = 0
bracket_end   = dt.find(">") + 1
bracket_start = dt.rfind("<", left_pointer, bracket_end)
bracket_start = dt.rfind(" ", left_pointer, bracket_start) + 1
partial_dt = dt[bracket_start : bracket_end]
partial_dt = _partial_convert_dt(partial_dt)

and

def _partial_convert_dt(partial_dt):
    if partial_dt.startswith("dictionary"):
        value_type = re.findall('values=([^,]{1,})',  partial_dt)[0]
        index_type = re.findall('indices=([^,]{1,})', partial_dt)[0]
        ordered    = re.findall('ordered=([\d])',     partial_dt)[0]
        if not value_type.startswith("pa."): value_type = f"pa.{value_type}()"
        if not index_type.startswith("pa."): index_type = f"pa.{index_type}()"
        partial_dt = f"pa.dictionary(value_type = {value_type}, index_type = {index_type}, ordered = {ordered})"
    elif partial_dt.startswith("list"):
        value_type = partial_dt[partial_dt.find(" ")+1 : -1]
        if not value_type.startswith("pa."): value_type = f"pa.{value_type}()"
        partial_dt = f"pa.list_({value_type})"
    elif partial_dt.startswith("struct"):
        struct_part = partial_dt[len("struct<"):-1]
        all_structs = struct_part.split(", ")
        converted = []
        for struct in all_structs:
            name, type = struct.split(": ")
            if not type.startswith("pa."): type = f"pa.{type}()"
            converted.append(f"('{name}', {type})")
        partial_dt = f"pa.struct(({', '.join(converted)}))"
    return partial_dt
pass

The code just gets too cumbersome sadly - the above only supports struct, dictionary and list types.

The main issue is the infinite nesting of Arrow dtypes which overcomplicates the conversion process in my view.

danielhanchen avatar May 13 '23 14:05 danielhanchen

Actually a simpler solution is to directly all .replace on the string and replace list<element: to pa.list_(( etc.

However, this doesnt work with struct data-types, since struct also keeps note of each field name.

This means a struct field name could have dictionary as it's name, which means using eval will fail.

This probably means string parsing won't work for structs, but works for everything else. I still believe a try except is the simplest solution. Obviously now Python 3.11 has zero cost exceptions, which means if the conversion fails and gets to the except portion, it'll be slower. This means a refactoring of code by parsing in the Arrow data-type, and if the data-type does not exist in registry then we output the Arrow data-type.

danielhanchen avatar May 13 '23 14:05 danielhanchen

Yeah, after some experimenting, I think we need to gave up on parsing the type string:

These two:

pd.Series([{'a': 1, 'b': 1}], dtype=pd.ArrowDtype(pa.struct({'a': pa.int64(), 'b': pa.int64()})))
pd.Series([{'a: int64, b': 1}], dtype=pd.ArrowDtype(pa.struct({'a: int64, b': pa.int64()})))

both have the following type string: struct<a: int64, b: int64>[pyarrow].

But even if we disallow such cases, it is just too hard: I tried to write a recursive parser with some regexp, but I gave up. We need a balancing matcher or a recursive pattern to match the nested <> pairs properly, but none of them are supported by the built in regexp module. And I don't feel like we should write a full blown recursive descent parser for this one.

The fundamental problem is we try to parse a string which was not meant to be easily parsable. The metadata should save the nested data types in a way that is easy to work with...

takacsd avatar May 15 '23 16:05 takacsd

I was bored:

class ParseFail(Exception):
    pass

class Parsed(NamedTuple):
    type: pa.DataType
    end: int


class TypeStringParser:

    BASIC_TYPE_MATCHER = re.compile(r'\w+(\[[^\]]+\])?')
    TIMESTAMP_MATCHER = re.compile(r'timestamp\[([^,]+), tz=([^\]]+)\]')
    NAME_MATCHER = re.compile(r'\w+')  # this can be r'[^:]' to support weird names in struct

    def __init__(self, type_str: str) -> None:
        self.type_str = type_str

    def parse(self) -> pa.DataType:
        try:
            parsed = self.type(0)
        except ParseFail:
            raise ValueError(f"Can't parse '{self.type_str}' as a type.")

        if parsed.end != len(self.type_str):
            raise ValueError(f"Can't parse '{self.type_str}' as a type.")

        return self.type(0).type

    def type(self, pos: int) -> Parsed:
        try:
            return self.basic_type(pos)
        except ParseFail:
            pass

        try:
            return self.timestamp(pos)
        except ParseFail:
            pass

        try:
            return self.list(pos)
        except ParseFail:
            pass

        try:
            return self.dictionary(pos)
        except ParseFail:
            pass

        try:
            return self.struct(pos)
        except ParseFail:
            pass

        raise ParseFail()

    def basic_type(self, pos: int) -> pa.DataType:
        match = self.BASIC_TYPE_MATCHER.match(self.type_str, pos)
        if match is None:
            raise ParseFail()
        try:
            return Parsed(pa.type_for_alias(match.group(0)), match.end(0))
        except ValueError:
            pass
        raise ParseFail()

    def timestamp(self, pos: int) -> pa.DataType:
        match = self.TIMESTAMP_MATCHER.match(self.type_str, pos)
        if match is None:
            raise ParseFail()
        try:
            return Parsed(pa.timestamp(match.group(1).strip(), tz=match.group(2).strip()), match.end(0))
        except ValueError:
            pass
        raise ParseFail()

    def list(self, pos: int) -> pa.DataType:
        pos = self.accept('list<', pos)
        match = self.NAME_MATCHER.match(self.type_str, pos)
        if match is None:
            raise ParseFail()
        pos = self.accept(': ', match.end(0))
        item = self.type(pos)
        pos = self.accept('>', item.end)
        return Parsed(pa.list_(item.type), pos)

    def dictionary(self, pos: int) -> pa.DataType:
        pos = self.accept('dictionary<values=', pos)
        values = self.type(pos)
        pos = self.accept(', indices=', values.end)
        indices = self.type(pos)
        pos = self.accept(', ordered=', indices.end)
        try:
            pos = self.accept('0', pos)
            ordered = False
        except ParseFail:
            pos = self.accept('1', pos)
            ordered = True
        pos = self.accept('>', pos)
        return Parsed(pa.dictionary(indices.type, values.type, ordered), pos)

    def struct(self, pos: int) -> pa.DataType:
        pos = self.accept('struct<', pos)
        elements = []
        while self.type_str[pos] != '>':
            match = self.NAME_MATCHER.match(self.type_str, pos)
            if match is None:
                raise ParseFail()
            element_name = match.group(0)
            pos = self.accept(': ', match.end(0))
            element_type = self.type(pos)
            pos = element_type.end
            if self.type_str[pos] != '>':
                pos = self.accept(', ', pos)
            elements.append((element_name, element_type.type))
        pos = self.accept('>', pos)
        return Parsed(pa.struct(elements), pos)

    def accept(self, term: str, pos: int) -> int:
        if self.type_str.startswith(term, pos):
            return pos + len(term)
        raise ParseFail()

Probably not the prettiest recursive descent parser in existence, but it does parse arbitrary nested types. The only restriction that I know of is that the names in the structs needs to be alphanumeric.

takacsd avatar May 15 '23 21:05 takacsd

@takacsd Nice work on the parser! :) Ye struct is the biggest issue with it being able to have column names. It gets worse if struct<struct : uint8> exists - yikes that'll be a painful pain.

Also I just noticed but https://github.com/apache/arrow/blob/8be70c137289adba92871555ce74055719172f56/python/pyarrow/pandas_compat.py#L870 actually does in fact parse Arrow Dtypes! The issue is the code previous to it breaks, and it never gets there.

    for field in table.schema:
        typ = field.type
        if isinstance(typ, pa.BaseExtensionType):
            try:
                pandas_dtype = typ.to_pandas_dtype()
            except NotImplementedError:
                pass
            else:
                ext_columns[field.name] = pandas_dtype

The issue is https://github.com/apache/arrow/blob/8be70c137289adba92871555ce74055719172f56/python/pyarrow/pandas_compat.py#LL854C1-L868C53:

    # infer the extension columns from the pandas metadata
    for col_meta in columns_metadata:
        try:
            name = col_meta['field_name']
        except KeyError:
            name = col_meta['name']
        dtype = col_meta['numpy_type']

        if dtype not in _pandas_supported_numpy_types:
            # pandas_dtype is expensive, so avoid doing this for types
            # that are certainly numpy dtypes
            pandas_dtype = _pandas_api.pandas_dtype(dtype)               >>>>>>>> BREAKS (A)
            if isinstance(pandas_dtype, _pandas_api.extension_dtype):
                if hasattr(pandas_dtype, "__from_arrow__"):
                    ext_columns[name] = pandas_dtype

I think I might have fixed it WITHOUT using try except

            # infer the extension columns from the pandas metadata
            schema = table.schema                                                 <<<<
            for col_meta, field in zip(columns_metadata, schema):                       <<<<
                try:
                    name = col_meta['field_name']
                except KeyError:
                    name = col_meta['name']
                dtype = col_meta['numpy_type']

		if dtype not in _pandas_supported_numpy_types:
                        # pandas_dtype is expensive, so avoid doing this for types
                        # that are certainly numpy dtypes
			if dtype.endswith("[pyarrow]"): pandas_dtype = pd.ArrowDtype(field.type)  <<<< (1)
			elif dtype == "string": pandas_dtype = pd.ArrowDtype(pa.string())        <<<< (2)
			else: pandas_dtype = pd_dtype(dtype)                                    <<<< (3)
                        if isinstance(pandas_dtype, _pandas_api.extension_dtype):
                            if hasattr(pandas_dtype, "__from_arrow__"):
                                ext_columns[name] = pandas_dtype

We push the original command (A) to line (3) than if a string data-type has [pyarrow] as it's ending, we use pd.ArrowDtype. For strings, we ignore the pandas parser and just parse strings in the fastpath.

This also means

    for field in table.schema:
        typ = field.type
        if isinstance(typ, pa.BaseExtensionType):
            try:
                pandas_dtype = typ.to_pandas_dtype()
            except NotImplementedError:
                pass
            else:
                ext_columns[field.name] = pandas_dtype

can be deleted - it'ls redundant, since we folded the code into the previous code.

danielhanchen avatar May 16 '23 03:05 danielhanchen

I just hit this today trying to read a parquet file made by someone else, where they had used the pyarrow backend.

Here is another minimal example to add to the mix that fails on reading df2.

import io

import numpy as np
import pandas as pd
import pyarrow as pa


def main():
    df0 = pd.DataFrame(
        [
            {"foo": {"bar": True, "baz": np.float32(1)}},
            {"foo": {"bar": True, "baz": None}},
        ],
    )
    schema = pa.schema(
        [
            pa.field(
                "foo",
                pa.struct(
                    [
                        pa.field("bar", pa.bool_(), nullable=False),
                        pa.field("baz", pa.float32(), nullable=True),
                    ],
                ),
            ),
        ],
    )
    print(schema)
    with io.BytesIO() as stream0, io.BytesIO() as stream1:
        kwargs = {
            "engine": "pyarrow",
            "compression": "zstd",
            "schema": schema,
            "row_group_size": 2_000,
        }
        print("Writing df0")
        df0.to_parquet(stream0, **kwargs)

        print("Reading df1")
        stream0.seek(0)
        df1 = pd.read_parquet(stream0, engine="pyarrow", dtype_backend="pyarrow")

        print("Writing df1")
        df1.to_parquet(stream1, **kwargs)

        print("Reading df2")
        stream1.seek(0)
        df2 = pd.read_parquet(stream1, engine="pyarrow", dtype_backend="pyarrow")


if __name__ == "__main__":
    main()

Using df2 = pq.read_table(stream1).to_pandas(ignore_metadata=True) works for all of the reasons mentioned in the thread.

bretttully avatar Oct 12 '23 04:10 bretttully

I'm running into this issue as well:

Screenshot 2024-02-07 at 19 50 26

giftculture avatar Feb 08 '24 01:02 giftculture

INSTALLED VERSIONS

commit : f538741432edf55c6b9fb5d0d496d2dd1d7c2457 python : 3.11.7.final.0 python-bits : 64 OS : Linux OS-release : 4.13.9-300.fc27.x86_64 Version : #1 SMP Mon Oct 23 13:41:58 UTC 2017 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 2.2.0 numpy : 1.26.3 pytz : 2023.4 dateutil : 2.8.2 setuptools : 69.0.3 pip : 23.3.2 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : 2.9.9 jinja2 : 3.1.3 IPython : 8.20.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.2 bottleneck : None dataframe-api-compat : None fastparquet : 2023.10.1 fsspec : 2023.12.2 gcsfs : None matplotlib : 3.8.2 numba : 0.58.1 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 15.0.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.12.0 sqlalchemy : 2.0.25 tables : None tabulate : None xarray : 2024.1.1 xlrd : None zstandard : None tzdata : 2023.4 qtpy : 2.4.1 pyqt5 : None

giftculture avatar Feb 08 '24 01:02 giftculture

The only "workaround" at the pandas-level I've found is to set df.to_parquet(..., store_schema=False) for a df containing complex/nested types like a categorical. Are there any plans to get successful dtype roundtripping going forward? Is there anything in the pyarrow library we can leverage here?

Versions:

INSTALLED VERSIONS
------------------
commit                : bdc79c146c2e32f2cab629be240f01658cfb6cc2
python                : 3.11.6.final.0
python-bits           : 64
OS                    : Windows
OS-release            : 10
Version               : 10.0.19045
machine               : AMD64
processor             : Intel64 Family 6 Model 186 Stepping 2, GenuineIntel
byteorder             : little
LC_ALL                : None
LANG                  : None
LOCALE                : English_United States.1252

pandas                : 2.2.1
numpy                 : 1.26.4
pytz                  : 2024.1
dateutil              : 2.9.0.post0
setuptools            : 65.5.0
pip                   : 23.2.1
Cython                : None
pytest                : 8.1.1
hypothesis            : None
sphinx                : None
blosc                 : None
feather               : None
xlsxwriter            : None
lxml.etree            : None
html5lib              : None
pymysql               : None
psycopg2              : 2.9.9
jinja2                : None
IPython               : 8.23.0
pandas_datareader     : None
adbc-driver-postgresql: None
adbc-driver-sqlite    : None
bs4                   : None
bottleneck            : None
dataframe-api-compat  : None
fastparquet           : None
fsspec                : None
gcsfs                 : None
matplotlib            : None
numba                 : None
numexpr               : None
odfpy                 : None
openpyxl              : None
pandas_gbq            : None
pyarrow               : 15.0.2
pyreadstat            : None
python-calamine       : None
pyxlsb                : None
s3fs                  : None
scipy                 : None
sqlalchemy            : None
tables                : None
tabulate              : None
xarray                : None
xlrd                  : None
zstandard             : None
tzdata                : 2024.1
qtpy                  : None
pyqt5                 : None

jborman-stonex avatar Apr 04 '24 12:04 jborman-stonex

I would recommend using dtype_backend="pyarrow"

phofl avatar Apr 04 '24 13:04 phofl

I would recommend using dtype_backend="pyarrow"

@phoff, not sure if you saw from my screenshot, but I did apply the dypte_backend="pyarrow" to the read_parquet method and it still fails, unless I am misunderstanding your suggestion

giftculture avatar Apr 04 '24 13:04 giftculture

The only workaround I have found so far is the following (which works in all cases I have thought of, except round-tripping an empty dataframe with a struct or list type, setting the schema, and not using dtype_backend="pyarrow" when reading back in).

Would def welcome suggested improvements to this workaround!

Obv you can write this differently if you don't want a byte string returned, but for us that's what we want.

    def serialize(self, data: pd.DataFrame, **kwargs) -> bytes:
        """see BytesWriter.serialize -- Dump pandas dataframe to parquet bytes"""
        with io.BytesIO() as stream:

            schema = kwargs.pop("schema", None)
            all_arrow_types = all(isinstance(t, pd.ArrowDtype) for t in data.dtypes.tolist())
            # An empty dataframe may use default dtypes that are incompatible with the schema.
            # In this case, first cast to object, as the schema can always convert that to the correct type.
            if len(data) == 0 and schema is not None and not all_arrow_types:
                data = data.astype("object").astype({n: pd.ArrowDtype(schema.field(n).type) for n in schema.names})
            table = pa.Table.from_pandas(data, schema=schema)

            # drop pandas from the schema metadata to work around the bug where you can't read struct columns with
            # pandas metadata
            # see https://github.com/pandas-dev/pandas/issues/53011
            metadata = table.schema.metadata
            if b"pandas" in metadata and b"list" in metadata[b"pandas"] or b"struct" in metadata[b"pandas"]:
                del metadata[b"pandas"]
                table = table.replace_schema_metadata(metadata)

            pq.write_table(table, stream, **kwargs)
            return stream.getvalue()

bretttully avatar Apr 04 '24 22:04 bretttully