aws-sdk-pandas icon indicating copy to clipboard operation
aws-sdk-pandas copied to clipboard

s3.read_parquet(map_types) handles int and string inconsistently

Open marcelmindemann opened this issue 2 years ago • 1 comments

Describe the bug

awswrangler.s3.read_parquet() has the keyword argument map_types. In my understanding, it should decide whether awswrangler applies dtype mapping on top of the pandas schema that already exists in the parquet header metadata. map_types=False should return a DataFrame with exactly the schema that's set in the metadata header. If this understanding is false, this bug report may be invalid. The problem is that setting map_types=False applies dtype mapping to some columns and not to others, specifically with int64 and string columns.

How to Reproduce

The problem is with a DataFrame with a column of type int64 and string: Screen Shot 2022-09-07 at 16 41 30 If I save this DataFrame to S3 using s3.write_parquet, it correctly writes these types to the parquet header: Screen Shot 2022-09-07 at 16 40 49 The problem is reading back the same DataFrame using s3.read_parquet - neither True nor False for the map_types argument provides round-trip consistency. If it's True, the int64 column becomes Int64, if it's False, the string column becomes type object: Screen Shot 2022-09-07 at 16 37 53

Expected behavior

I would expect awswrangler to provide round-trip consistency in some way, even though it does not have to be enabled by default.

Your project

No response

Screenshots

No response

OS

macOS

Python version

3.9.10

AWS SDK for pandas version

2.16.1

Additional context

Versions: awswrangler 2.16.1 pyarrow 7.0.0 pandas 1.4.3

marcelmindemann avatar Sep 09 '22 08:09 marcelmindemann

Thanks for raising this @marcelmindemann. I can confirm that this is an issue in the current awswrangler versions. It stems from the fact that we don't give enough flexibility to the user when it comes to converting from an arrow table to a pandas DataFrame. As you can see here, we enforce ignore_metadata=True for instance, which leads to situations like the one you raised.

Good news is that we intend to fix this in the next major release (3.0) by enabling the user to override params passed to the to_pandas method. You can already test it:

pip install awswrangler==3.0.0a2

then:

import awswrangler as wr

path = "s3://my-bucket/my-path"

df = wr.s3.read_parquet(path=path, pyarrow_additional_kwargs={"ignore_metadata": False, "types_mapper": None})

df.dtypes
some_integer     int64
some_string     string
dtype: object

FYI @kukushking

jaidisido avatar Sep 16 '22 10:09 jaidisido