aws-sdk-pandas
aws-sdk-pandas copied to clipboard
wr.s3.to_parquet() fails to parse explicit map of decimals type
Description
This issue occurs when passing a dtype argument to wr.s3.to_parquet() to coerce one of the column to a map containing decimals, e.g. map<int, decimal(12,2)>. See reproduction below.
The helper function used to split the map fields doesn't take into account the parenthesis and splits on all commas, resulting in field typesint, decimal(12 and 2) instead of the expected int and decimal(12,2).
Environment
awswrangler 2.8.0
Reproduction
import awswrangler as wr
import pandas as pd
import decimal
df = pd.DataFrame({"map_col": [{"a": decimal.Decimal("1.23")}]})
wr.s3.to_parquet(
df=df,
dataset=True,
path="dummy-location",
database="dummy-db",
table="dummy-table",
dtype={"map_col": "map<int, decimal(12,2)>"},
)
Output:
Traceback (most recent call last):
File "awswrangler_map_decimal.py", line 14, in <module>
dtype={"map_col": "map<int, decimal(12,2)>"},
File "/home/laspj/.local/lib/python3.6/site-packages/awswrangler/_config.py", line 417, in wrapper
return function(**args)
File "/home/laspj/.local/lib/python3.6/site-packages/awswrangler/s3/_write_parquet.py", line 537, in to_parquet
df=df, index=index, ignore_cols=partition_cols, dtype=dtype
File "/home/laspj/.local/lib/python3.6/site-packages/awswrangler/_data_types.py", line 581, in pyarrow_schema_from_pandas
columns_types[k] = athena2pyarrow(dtype=v)
File "/home/laspj/.local/lib/python3.6/site-packages/awswrangler/_data_types.py", line 291, in athena2pyarrow
parts: List[str] = _split_map(s=orig_dtype[4:-1])
File "/home/laspj/.local/lib/python3.6/site-packages/awswrangler/_data_types.py", line 250, in _split_map
raise RuntimeError(f"Invalid map fields: {s}")
RuntimeError: Invalid map fields: int, decimal(12,2)
Thank you for raising this. I confirm that the issue is related to the split_fields method not handling decimals within parentheses. Will push a fix to address it
Hi @jaidisido I came across a similar issue with a struct that include a decimal(6,3) field: { "Type": "struct<3h:decimal(6,3)>", "Name": "rain" }
Any updates on this?
Looks like this was already addressed in https://github.com/aws/aws-sdk-pandas/pull/1179