aws-sdk-pandas icon indicating copy to clipboard operation
aws-sdk-pandas copied to clipboard

wr.s3.to_parquet() incorrectly identifies empty columns

Open coldenol opened this issue 1 year ago • 1 comments

Describe the bug

When I try to write a dataframe using wr.s3.to_parquet() method it often throws the following error on some columns:

UndetectedType Impossible to infer the equivalent Athena data type for the billingaddress column. It is completely empty (only null values) and has a too generic data type (object). Please, cast this columns with a more deterministic data type (e.g. df['billingaddress'] = df['billingaddress'].astype('string')) or pass the column schema as argument(e.g. dtype={'billingaddress': 'string'}

How to Reproduce

First I tried to make a pre-validation before passing the dataframe to the method but it helped partially:

dtype = {}
  for column in data.columns:
      if data[column].isnull().all():
          dtype[column] = "string"
data = data.astype(dtype)

I then discovered that the dataframe column that the method defined as “completely empty” contains a few values, say 3 values per 500 dataframe records.

It seems that the method only analyzes a limited sample of rows to determine a completely empty column and if there are only a few rows with values in the column, then there is a high probability that the column will be treated as completely empty.

The following code is what I did as a working workaround, but with the downside that the data type is rigidly equated to a string.

def _validate_dtypes(self, data: pd.DataFrame) -> pd.DataFrame:
    # awswrangler can not infer the data type from an entire (or mostly) null
    # column, thus we need to manually set the data type for such columns

    dtype = {}

    for column in data.columns:
        non_null_count = data[column].notnull().sum()
        total_count = len(data[column])

        if non_null_count == 0:
            dtype[column] = "string"
        # awswrangler defines a column that has only a few values also as empty
        # this is "less than 10% non-null values" workaround
        elif (non_null_count / total_count < 0.1):
            dtype[column] = "string"

    data = data.astype(dtype)

    return data

I would appreciate a solving this problem. Thank you for the great tool!

Expected behavior

No response

Your project

No response

Screenshots

No response

OS

Linux

Python version

3.9

AWS SDK for pandas version

awswrangler-3.9.1-py3-none-any.whl

Additional context

No response

coldenol avatar Sep 27 '24 13:09 coldenol

Hey Denis, sounds like a plan. I encountered a similar problem and was wondering if this logic could be incorporated into the _df_to_table function here.

However, it might not be necessary to define dtype = {}. Instead, it could only handle cases where the type is 'object' to improve efficiency.

for column in df.select_dtypes(include=['object']).columns:
    # Check if less than 10% of the column values are non-null, treat it as empty
    non_null_ratio = df[column].notnull().sum() / len(df)
    if non_null_ratio < 0.1:
        df[column] = df[column].astype('string')  # Default to 'string' if mostly null
    elif df[column].notnull().sum() == 0:
        df[column] = df[column].astype('string')  # Default to 'string' if all values are null

Tian-2017 avatar Oct 05 '24 17:10 Tian-2017

Marking this issue as stale due to inactivity. This helps our maintainers find and focus on the active issues. If this issue receives no comments in the next 7 days it will automatically be closed.

github-actions[bot] avatar Dec 06 '24 09:12 github-actions[bot]