aws-sdk-pandas
aws-sdk-pandas copied to clipboard
s3.read_parquet(map_types) handles int and string inconsistently
Describe the bug
awswrangler.s3.read_parquet()
has the keyword argument map_types
. In my understanding, it should decide whether awswrangler applies dtype mapping on top of the pandas schema that already exists in the parquet header metadata. map_types=False
should return a DataFrame with exactly the schema that's set in the metadata header. If this understanding is false, this bug report may be invalid. The problem is that setting map_types=False
applies dtype mapping to some columns and not to others, specifically with int64
and string
columns.
How to Reproduce
The problem is with a DataFrame with a column of type int64
and string
:
If I save this DataFrame to S3 using
s3.write_parquet
, it correctly writes these types to the parquet header:
The problem is reading back the same DataFrame using
s3.read_parquet
- neither True
nor False
for the map_types
argument provides round-trip consistency. If it's True
, the int64
column becomes Int64
, if it's False
, the string
column becomes type object
:
Expected behavior
I would expect awswrangler to provide round-trip consistency in some way, even though it does not have to be enabled by default.
Your project
No response
Screenshots
No response
OS
macOS
Python version
3.9.10
AWS SDK for pandas version
2.16.1
Additional context
Versions:
awswrangler 2.16.1
pyarrow 7.0.0
pandas 1.4.3
Thanks for raising this @marcelmindemann. I can confirm that this is an issue in the current awswrangler
versions. It stems from the fact that we don't give enough flexibility to the user when it comes to converting from an arrow table to a pandas DataFrame. As you can see here, we enforce ignore_metadata=True
for instance, which leads to situations like the one you raised.
Good news is that we intend to fix this in the next major release (3.0) by enabling the user to override params passed to the to_pandas
method. You can already test it:
pip install awswrangler==3.0.0a2
then:
import awswrangler as wr
path = "s3://my-bucket/my-path"
df = wr.s3.read_parquet(path=path, pyarrow_additional_kwargs={"ignore_metadata": False, "types_mapper": None})
df.dtypes
some_integer int64
some_string string
dtype: object
FYI @kukushking