aws-sdk-pandas icon indicating copy to clipboard operation
aws-sdk-pandas copied to clipboard

wr.s3.to_parquet() incorrectly identifies empty columns

Open coldenol opened this issue 5 months ago • 1 comments

Describe the bug

When I try to write a dataframe using wr.s3.to_parquet() method it often throws the following error on some columns:

UndetectedType Impossible to infer the equivalent Athena data type for the billingaddress column. It is completely empty (only null values) and has a too generic data type (object). Please, cast this columns with a more deterministic data type (e.g. df['billingaddress'] = df['billingaddress'].astype('string')) or pass the column schema as argument(e.g. dtype={'billingaddress': 'string'}

How to Reproduce

First I tried to make a pre-validation before passing the dataframe to the method but it helped partially:

dtype = {}
  for column in data.columns:
      if data[column].isnull().all():
          dtype[column] = "string"
data = data.astype(dtype)

I then discovered that the dataframe column that the method defined as “completely empty” contains a few values, say 3 values per 500 dataframe records.

It seems that the method only analyzes a limited sample of rows to determine a completely empty column and if there are only a few rows with values in the column, then there is a high probability that the column will be treated as completely empty.

The following code is what I did as a working workaround, but with the downside that the data type is rigidly equated to a string.

def _validate_dtypes(self, data: pd.DataFrame) -> pd.DataFrame:
    # awswrangler can not infer the data type from an entire (or mostly) null
    # column, thus we need to manually set the data type for such columns

    dtype = {}

    for column in data.columns:
        non_null_count = data[column].notnull().sum()
        total_count = len(data[column])

        if non_null_count == 0:
            dtype[column] = "string"
        # awswrangler defines a column that has only a few values also as empty
        # this is "less than 10% non-null values" workaround
        elif (non_null_count / total_count < 0.1):
            dtype[column] = "string"

    data = data.astype(dtype)

    return data

I would appreciate a solving this problem. Thank you for the great tool!

Expected behavior

No response

Your project

No response

Screenshots

No response

OS

Linux

Python version

3.9

AWS SDK for pandas version

awswrangler-3.9.1-py3-none-any.whl

Additional context

No response

coldenol avatar Sep 27 '24 13:09 coldenol