aws-sdk-pandas
aws-sdk-pandas copied to clipboard
s3.read_parquet(map_types) handles int and string inconsistently
Describe the bug
awswrangler.s3.read_parquet() has the keyword argument map_types. In my understanding, it should decide whether awswrangler applies dtype mapping on top of the pandas schema that already exists in the parquet header metadata. map_types=False should return a DataFrame with exactly the schema that's set in the metadata header. If this understanding is false, this bug report may be invalid. The problem is that setting map_types=False applies dtype mapping to some columns and not to others, specifically with int64 and string columns.
How to Reproduce
The problem is with a DataFrame with a column of type int64 and string:
If I save this DataFrame to S3 using s3.write_parquet, it correctly writes these types to the parquet header:
The problem is reading back the same DataFrame using s3.read_parquet - neither True nor False for the map_types argument provides round-trip consistency. If it's True, the int64 column becomes Int64, if it's False, the string column becomes type object:

Expected behavior
I would expect awswrangler to provide round-trip consistency in some way, even though it does not have to be enabled by default.
Your project
No response
Screenshots
No response
OS
macOS
Python version
3.9.10
AWS SDK for pandas version
2.16.1
Additional context
Versions:
awswrangler 2.16.1
pyarrow 7.0.0
pandas 1.4.3
Thanks for raising this @marcelmindemann. I can confirm that this is an issue in the current awswrangler versions. It stems from the fact that we don't give enough flexibility to the user when it comes to converting from an arrow table to a pandas DataFrame. As you can see here, we enforce ignore_metadata=True for instance, which leads to situations like the one you raised.
Good news is that we intend to fix this in the next major release (3.0) by enabling the user to override params passed to the to_pandas method. You can already test it:
pip install awswrangler==3.0.0a2
then:
import awswrangler as wr
path = "s3://my-bucket/my-path"
df = wr.s3.read_parquet(path=path, pyarrow_additional_kwargs={"ignore_metadata": False, "types_mapper": None})
df.dtypes
some_integer int64
some_string string
dtype: object
FYI @kukushking