aws-sdk-pandas icon indicating copy to clipboard operation
aws-sdk-pandas copied to clipboard

wr.s3.to_parquet() fails to parse explicit map of decimals type

Open johanLsp opened this issue 4 years ago • 3 comments

Description

This issue occurs when passing a dtype argument to wr.s3.to_parquet() to coerce one of the column to a map containing decimals, e.g. map<int, decimal(12,2)>. See reproduction below.

The helper function used to split the map fields doesn't take into account the parenthesis and splits on all commas, resulting in field typesint, decimal(12 and 2) instead of the expected int and decimal(12,2).

Environment

awswrangler                   2.8.0

Reproduction

import awswrangler as wr
import pandas as pd
import decimal


df = pd.DataFrame({"map_col": [{"a": decimal.Decimal("1.23")}]})

wr.s3.to_parquet(
    df=df,
    dataset=True,
    path="dummy-location",
    database="dummy-db",
    table="dummy-table",
    dtype={"map_col": "map<int, decimal(12,2)>"},
)

Output:

Traceback (most recent call last):
  File "awswrangler_map_decimal.py", line 14, in <module>
    dtype={"map_col": "map<int, decimal(12,2)>"},
  File "/home/laspj/.local/lib/python3.6/site-packages/awswrangler/_config.py", line 417, in wrapper
    return function(**args)
  File "/home/laspj/.local/lib/python3.6/site-packages/awswrangler/s3/_write_parquet.py", line 537, in to_parquet
    df=df, index=index, ignore_cols=partition_cols, dtype=dtype
  File "/home/laspj/.local/lib/python3.6/site-packages/awswrangler/_data_types.py", line 581, in pyarrow_schema_from_pandas
    columns_types[k] = athena2pyarrow(dtype=v)
  File "/home/laspj/.local/lib/python3.6/site-packages/awswrangler/_data_types.py", line 291, in athena2pyarrow
    parts: List[str] = _split_map(s=orig_dtype[4:-1])
  File "/home/laspj/.local/lib/python3.6/site-packages/awswrangler/_data_types.py", line 250, in _split_map
    raise RuntimeError(f"Invalid map fields: {s}")
RuntimeError: Invalid map fields: int, decimal(12,2)

johanLsp avatar Jun 17 '21 13:06 johanLsp

Thank you for raising this. I confirm that the issue is related to the split_fields method not handling decimals within parentheses. Will push a fix to address it

jaidisido avatar Jun 22 '21 21:06 jaidisido

Hi @jaidisido I came across a similar issue with a struct that include a decimal(6,3) field: { "Type": "struct<3h:decimal(6,3)>", "Name": "rain" }

Any updates on this?

adadouche avatar Jan 07 '22 19:01 adadouche

Looks like this was already addressed in https://github.com/aws/aws-sdk-pandas/pull/1179

kukushking avatar Mar 02 '23 14:03 kukushking