pandera icon indicating copy to clipboard operation
pandera copied to clipboard

Pandera DataFrameSchema Column definition with array

Open wolfig opened this issue 8 months ago • 2 comments

Python version: 3.11 pandera version: 0.23.0

I am dealing with data in pandas DataFrames containing columns with array-like data and base types float, int, str, bool. Now I want to create a DataFrameSchema checking these columns, something like

'columnName', Column(array data type, ..., default=[])

Currently, I use 'object') as data type, but this causes problems in certain cases. What is the right data type for these kind of data?

I tried what is suggested in the dtype validation documentation, but any variation led to an error:

First attempt:

schema = DataFrameSchema(
            {
                'col1': Column(int, default=-1),
                'col2': Column(float, nullable=True),
                'col3': Column(bool, nullable=True),
                ... more scalar columns
                'ColumnWithArray1': Column(List[float]), <-------- !
                'ColumnWithArray2': Column(object, nullable=True),
                ... more columns scalar and array
            }
       )

Leads to:

File ".../python3.11/site-packages/pandera/backends/pandas/container.py", line 122, in validate
    raise SchemaErrors(
pandera.errors.SchemaErrors: {
    "SCHEMA": {
        "WRONG_DATATYPE": [
            {
               {  All scalar type columns are listed here, not array/object type columns}

Second attempt:

...
'ColumnWithArray1': Column(List[float], default=[]),
'ColumnWithArray2': Column(object, nullable=True),
...

Exception:

File ".../python3.11/site-packages/pandera/backends/pandas/container.py", line 554, in set_defaults
    if (
ValueError: The truth value of an empty array is ambiguous. Use `array.size > 0` to check that an array is not empty.

Third attempt

...
'ColumnWithArray1': Column(List[float], default=[]),
'ColumnWithArray2': Column(object, nullable=True),
...

Exception: same as first attempt

Fourth attempt

...
'ColumnWithArray1': Column(List[float], default=[1.0]),
'ColumnWithArray2': Column(object, nullable=True),
...

Exception:

File ".../python3.11/site-packages/pandas/util/_validators.py", line 299, in validate_fillna_kwargs
    raise TypeError(
TypeError: "value" parameter must be a scalar or dict, but you passed a "list"

wolfig avatar Mar 05 '25 12:03 wolfig

Hi @wolfig this looks like a bug. Do you mind sharing minimally reproducible code?

cosmicBboy avatar Mar 06 '25 01:03 cosmicBboy

Sure. I created a MinimalExample.py creating these errors together with two test data CSVs. In my tests, these exactly reproduce what I see. However, the problem/example is somewhat more complex than I described, I need to elaborate on this:

  • The scenario I am working in is a REST service providing data as JSON. I parse the JSON and create a DataFrame from it.
  • The Data from the REST service is coming in chunks as the REST service cannot provide Gigabytes of data in one go.
  • I write the chunked data as DataFrame by DataFrame to one parquet file by appending the file. This is by the way why I use a DataFrameSchema which creates missing columns on the fly when performing validation.
  • It can happen/happened to me that from chunk to chunk the columns contained in the JSON vary: in one chunk they are available in the next they are not or the other way around

. This is what I try to simulate by providing 2 TestData files, one without the array-like column, one where the array-like column is "switched on" within the data table.

TestData1.csv TestData2.csv

Test code:

import pandas as pd
from pandera import Column, DataFrameSchema
import os
import traceback as tb


def get_schema_1():
    schema = DataFrameSchema(
        {
            'intColumn': Column(int, default=-1),
            'floatColumn': Column(float, nullable=True),
            'boolColumn': Column(bool, nullable=True),
            'strColumn': Column(str, nullable=True),
            'listColumn': Column(list[float], nullable=True)
        }
        , name='my_schema'
        , drop_invalid_rows=False
        , coerce=True
        , add_missing_columns=True
        , unique_column_names=True
    )

    return schema


def get_schema_2():
    schema = DataFrameSchema(
        {
            'intColumn': Column(int, default=-1),
            'floatColumn': Column(float, nullable=True),
            'boolColumn': Column(bool, nullable=True),
            'strColumn': Column(str, nullable=True),
            'listColumn': Column(list[float], default=[])
        }
        , name='my_schema'
        , drop_invalid_rows=False
        , coerce=True
        , add_missing_columns=True
        , unique_column_names=True
    )

    return schema

def get_schema_3():
    schema = DataFrameSchema(
        {
            'intColumn': Column(int, default=-1),
            'floatColumn': Column(float, nullable=True),
            'boolColumn': Column(bool, nullable=True),
            'strColumn': Column(str, nullable=True),
            'listColumn': Column(list[float], default=[1.0])
        }
        , name='my_schema'
        , drop_invalid_rows=False
        , coerce=True
        , add_missing_columns=True
        , unique_column_names=True
    )

    return schema

if __name__ == '__main__':
    try:
        test_data = pd.read_csv('TestData1.csv', delimiter=';')
        validated_data = get_schema_1().validate(test_data)
        validated_data.to_parquet('output1.parquet', engine='fastparquet', append=os.path.isfile('output1.parquet'))
    except Exception as err:
        print('[ERROR] Schema 1: Validating/Writing TestData1  creating output1 failed.')
        tb.print_exc()
        pass

    try:
        test_data = pd.read_csv('TestData2.csv', delimiter=';')
        validated_data = get_schema_1().validate(test_data)
        validated_data.to_parquet('output1.parquet', engine='fastparquet', append=os.path.isfile('output1.parquet'))
    except Exception as err:
        print('[ERROR] Schema 1: Validating/Writing TestData2 appending output1 failed.')
        tb.print_exc()
        pass

    try:
        test_data = pd.read_csv('TestData1.csv', delimiter=';')
        validated_data = get_schema_2().validate(test_data)
        validated_data.to_parquet('output2.parquet', engine='fastparquet', append=os.path.isfile('output2.parquet'))
    except Exception as err:
        print('[ERROR] Schema 2: Validating/Writing TestData2  creating output2 failed.')
        tb.print_exc()
        pass

    try:
        test_data = pd.read_csv('TestData2.csv', delimiter=';')
        validated_data = get_schema_2().validate(test_data)
        validated_data.to_parquet('output2.parquet', engine='fastparquet', append=os.path.isfile('output2.parquet'))
    except Exception as err:
        print('[ERROR] Schema 2: Validating/Writing TestData2 appending output2 failed.')
        tb.print_exc()
        pass

    try:
        test_data = pd.read_csv('TestData1.csv', delimiter=';')
        validated_data = get_schema_3().validate(test_data)
        validated_data.to_parquet('output3.parquet', engine='fastparquet', append=os.path.isfile('output3.parquet'))
    except Exception as err:
        print('[ERROR] Schema 3: Validating/Writing TestData1  creating output3 failed.')
        tb.print_exc()
        pass

    try:
        test_data = pd.read_csv('TestData2.csv', delimiter=';')
        validated_data = get_schema_3().validate(test_data)
        validated_data.to_parquet('output3.parquet', engine='fastparquet', append=os.path.isfile('output3.parquet'))
    except Exception as err:
        print('[ERROR] Schema 3: Validating/Writing TestData2 appending output3 failed.')
        tb.print_exc()
        pass

wolfig avatar Mar 06 '25 14:03 wolfig