pandera icon indicating copy to clipboard operation
pandera copied to clipboard

fix pandas pyarrow string validation

Open aaravind100 opened this issue 1 year ago • 2 comments
trafficstars

Fixes a bug where pyarrow string would give a schema validation error.

Snippet:

import pandas as pd
import pandera as pa
import pyarrow

df = pd.DataFrame([{"foo": "bar"}], dtype=pd.ArrowDtype(pyarrow.string()))
df.info()

Schema = pa.DataFrameSchema({"foo": pa.Column(pyarrow.string)})
Schema.validate(df).info()

Before:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype          
---  ------  --------------  -----          
 0   foo     1 non-null      string[pyarrow]
dtypes: string[pyarrow](1)
memory usage: 139.0 bytes
Traceback (most recent call last):
  File "/home/jovyan/work/pandera/scraps.py", line 61, in <module>
    Schema.validate(df).info()
    ^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/work/pandera/pandera/api/pandas/container.py", line 125, in validate
    return self._validate(
           ^^^^^^^^^^^^^^^
  File "/home/jovyan/work/pandera/pandera/api/pandas/container.py", line 154, in _validate
    return self.get_backend(check_obj).validate(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/work/pandera/pandera/backends/pandas/container.py", line 104, in validate
    error_handler = self.run_checks_and_handle_errors(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/work/pandera/pandera/backends/pandas/container.py", line 179, in run_checks_and_handle_errors
    error_handler.collect_error(
  File "/home/jovyan/work/pandera/pandera/api/base/error_handler.py", line 54, in collect_error
    raise schema_error from original_exc
  File "/home/jovyan/work/pandera/pandera/backends/pandas/container.py", line 200, in run_schema_component_checks
    result = schema_component.validate(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/work/pandera/pandera/api/dataframe/components.py", line 163, in validate
    return self.get_backend(check_obj).validate(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/work/pandera/pandera/backends/pandas/components.py", line 132, in validate
    validate_column(check_obj, column_name)
  File "/home/jovyan/work/pandera/pandera/backends/pandas/components.py", line 92, in validate_column
    error_handler.collect_error(
  File "/home/jovyan/work/pandera/pandera/api/base/error_handler.py", line 54, in collect_error
    raise schema_error from original_exc
  File "/home/jovyan/work/pandera/pandera/backends/pandas/components.py", line 72, in validate_column
    validated_check_obj = super(ColumnBackend, self).validate(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/work/pandera/pandera/backends/pandas/array.py", line 81, in validate
    error_handler = self.run_checks_and_handle_errors(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/work/pandera/pandera/backends/pandas/array.py", line 145, in run_checks_and_handle_errors
    error_handler.collect_error(
  File "/home/jovyan/work/pandera/pandera/api/base/error_handler.py", line 54, in collect_error
    raise schema_error from original_exc
pandera.errors.SchemaError: expected series 'foo' to have type string[pyarrow], got string[pyarrow]

After:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype          
---  ------  --------------  -----          
 0   foo     1 non-null      string[pyarrow]
dtypes: string[pyarrow](1)
memory usage: 139.0 bytes
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype          
---  ------  --------------  -----          
 0   foo     1 non-null      string[pyarrow]
dtypes: string[pyarrow](1)
memory usage: 139.0 bytes

aaravind100 avatar May 11 '24 11:05 aaravind100

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 83.27%. Comparing base (4df61da) to head (954b6c5). Report is 91 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1636       +/-   ##
===========================================
- Coverage   94.29%   83.27%   -11.02%     
===========================================
  Files          91      116       +25     
  Lines        7024     8646     +1622     
===========================================
+ Hits         6623     7200      +577     
- Misses        401     1446     +1045     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar May 11 '24 11:05 codecov[bot]

@cosmicBboy could it be the uv cache is bugged? I remember seeing something similar a few weeks back. We could try cleaning the cache with uv cache clean.

aaravind100 avatar May 12 '24 08:05 aaravind100