pandera
pandera copied to clipboard
Pandera DataFrameSchema Column definition with array
Python version: 3.11 pandera version: 0.23.0
I am dealing with data in pandas DataFrames containing columns with array-like data and base types float, int, str, bool. Now I want to create a DataFrameSchema checking these columns, something like
'columnName', Column(array data type, ..., default=[])
Currently, I use 'object') as data type, but this causes problems in certain cases. What is the right data type for these kind of data?
I tried what is suggested in the dtype validation documentation, but any variation led to an error:
First attempt:
schema = DataFrameSchema(
{
'col1': Column(int, default=-1),
'col2': Column(float, nullable=True),
'col3': Column(bool, nullable=True),
... more scalar columns
'ColumnWithArray1': Column(List[float]), <-------- !
'ColumnWithArray2': Column(object, nullable=True),
... more columns scalar and array
}
)
Leads to:
File ".../python3.11/site-packages/pandera/backends/pandas/container.py", line 122, in validate
raise SchemaErrors(
pandera.errors.SchemaErrors: {
"SCHEMA": {
"WRONG_DATATYPE": [
{
{ All scalar type columns are listed here, not array/object type columns}
Second attempt:
...
'ColumnWithArray1': Column(List[float], default=[]),
'ColumnWithArray2': Column(object, nullable=True),
...
Exception:
File ".../python3.11/site-packages/pandera/backends/pandas/container.py", line 554, in set_defaults
if (
ValueError: The truth value of an empty array is ambiguous. Use `array.size > 0` to check that an array is not empty.
Third attempt
...
'ColumnWithArray1': Column(List[float], default=[]),
'ColumnWithArray2': Column(object, nullable=True),
...
Exception: same as first attempt
Fourth attempt
...
'ColumnWithArray1': Column(List[float], default=[1.0]),
'ColumnWithArray2': Column(object, nullable=True),
...
Exception:
File ".../python3.11/site-packages/pandas/util/_validators.py", line 299, in validate_fillna_kwargs
raise TypeError(
TypeError: "value" parameter must be a scalar or dict, but you passed a "list"
Hi @wolfig this looks like a bug. Do you mind sharing minimally reproducible code?
Sure. I created a MinimalExample.py creating these errors together with two test data CSVs. In my tests, these exactly reproduce what I see. However, the problem/example is somewhat more complex than I described, I need to elaborate on this:
- The scenario I am working in is a REST service providing data as JSON. I parse the JSON and create a DataFrame from it.
- The Data from the REST service is coming in chunks as the REST service cannot provide Gigabytes of data in one go.
- I write the chunked data as DataFrame by DataFrame to one parquet file by appending the file. This is by the way why I use a DataFrameSchema which creates missing columns on the fly when performing validation.
- It can happen/happened to me that from chunk to chunk the columns contained in the JSON vary: in one chunk they are available in the next they are not or the other way around
. This is what I try to simulate by providing 2 TestData files, one without the array-like column, one where the array-like column is "switched on" within the data table.
Test code:
import pandas as pd
from pandera import Column, DataFrameSchema
import os
import traceback as tb
def get_schema_1():
schema = DataFrameSchema(
{
'intColumn': Column(int, default=-1),
'floatColumn': Column(float, nullable=True),
'boolColumn': Column(bool, nullable=True),
'strColumn': Column(str, nullable=True),
'listColumn': Column(list[float], nullable=True)
}
, name='my_schema'
, drop_invalid_rows=False
, coerce=True
, add_missing_columns=True
, unique_column_names=True
)
return schema
def get_schema_2():
schema = DataFrameSchema(
{
'intColumn': Column(int, default=-1),
'floatColumn': Column(float, nullable=True),
'boolColumn': Column(bool, nullable=True),
'strColumn': Column(str, nullable=True),
'listColumn': Column(list[float], default=[])
}
, name='my_schema'
, drop_invalid_rows=False
, coerce=True
, add_missing_columns=True
, unique_column_names=True
)
return schema
def get_schema_3():
schema = DataFrameSchema(
{
'intColumn': Column(int, default=-1),
'floatColumn': Column(float, nullable=True),
'boolColumn': Column(bool, nullable=True),
'strColumn': Column(str, nullable=True),
'listColumn': Column(list[float], default=[1.0])
}
, name='my_schema'
, drop_invalid_rows=False
, coerce=True
, add_missing_columns=True
, unique_column_names=True
)
return schema
if __name__ == '__main__':
try:
test_data = pd.read_csv('TestData1.csv', delimiter=';')
validated_data = get_schema_1().validate(test_data)
validated_data.to_parquet('output1.parquet', engine='fastparquet', append=os.path.isfile('output1.parquet'))
except Exception as err:
print('[ERROR] Schema 1: Validating/Writing TestData1 creating output1 failed.')
tb.print_exc()
pass
try:
test_data = pd.read_csv('TestData2.csv', delimiter=';')
validated_data = get_schema_1().validate(test_data)
validated_data.to_parquet('output1.parquet', engine='fastparquet', append=os.path.isfile('output1.parquet'))
except Exception as err:
print('[ERROR] Schema 1: Validating/Writing TestData2 appending output1 failed.')
tb.print_exc()
pass
try:
test_data = pd.read_csv('TestData1.csv', delimiter=';')
validated_data = get_schema_2().validate(test_data)
validated_data.to_parquet('output2.parquet', engine='fastparquet', append=os.path.isfile('output2.parquet'))
except Exception as err:
print('[ERROR] Schema 2: Validating/Writing TestData2 creating output2 failed.')
tb.print_exc()
pass
try:
test_data = pd.read_csv('TestData2.csv', delimiter=';')
validated_data = get_schema_2().validate(test_data)
validated_data.to_parquet('output2.parquet', engine='fastparquet', append=os.path.isfile('output2.parquet'))
except Exception as err:
print('[ERROR] Schema 2: Validating/Writing TestData2 appending output2 failed.')
tb.print_exc()
pass
try:
test_data = pd.read_csv('TestData1.csv', delimiter=';')
validated_data = get_schema_3().validate(test_data)
validated_data.to_parquet('output3.parquet', engine='fastparquet', append=os.path.isfile('output3.parquet'))
except Exception as err:
print('[ERROR] Schema 3: Validating/Writing TestData1 creating output3 failed.')
tb.print_exc()
pass
try:
test_data = pd.read_csv('TestData2.csv', delimiter=';')
validated_data = get_schema_3().validate(test_data)
validated_data.to_parquet('output3.parquet', engine='fastparquet', append=os.path.isfile('output3.parquet'))
except Exception as err:
print('[ERROR] Schema 3: Validating/Writing TestData2 appending output3 failed.')
tb.print_exc()
pass