pandera Pandas2 / pyarrow backend support

trafficstars

Describe the bug I can't generate a schema from a pyarrow-backed dataframe

Code Sample, a copy-pastable example

import io
import pandas as pd
import pandera 
data = 'id,date\n0e90a7243dbb433fbfb24e23f08b0684,08-05-2022\nb6242783029545f1ac86be6b950ed6d7,30-04-2023\n'

df = pd.read_csv(io.StringIO(data), engine='pyarrow', dtype_backend='pyarrow')
print(pd.__version__)
pandera.infer_schema(df)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[127], line 8
      6 df = pd.read_csv(io.StringIO(data), engine='pyarrow', dtype_backend='pyarrow')
      7 print(pd.__version__)
----> 8 pandera.infer_schema(df)

File ~/.envs/menv/lib/python3.10/site-packages/pandera/schema_inference/pandas.py:39, in infer_schema(pandas_obj)
     32 """Infer schema for pandas DataFrame or Series object.
     33 
     34 :param pandas_obj: DataFrame or Series object to infer.
     35 :returns: DataFrameSchema or SeriesSchema
     36 :raises: TypeError if pandas_obj is not expected type.
     37 """
     38 if isinstance(pandas_obj, pd.DataFrame):
---> 39     return infer_dataframe_schema(pandas_obj)
     40 elif isinstance(pandas_obj, pd.Series):
     41     return infer_series_schema(pandas_obj)

File ~/.envs/menv/lib/python3.10/site-packages/pandera/schema_inference/pandas.py:73, in infer_dataframe_schema(df)
     67 def infer_dataframe_schema(df: pd.DataFrame) -> DataFrameSchema:
     68     """Infer a DataFrameSchema from a pandas DataFrame.
     69 
     70     :param df: DataFrame object to infer.
     71     :returns: DataFrameSchema
     72     """
---> 73     df_statistics = infer_dataframe_statistics(df)
     74     schema = DataFrameSchema(
     75         columns={
     76             colname: Column(
   (...)
     84         coerce=True,
     85     )
     86     schema._is_inferred = True

File ~/.envs/menv/lib/python3.10/site-packages/pandera/schema_statistics/pandas.py:15, in infer_dataframe_statistics(df)
     13 """Infer column and index statistics from a pandas DataFrame."""
     14 nullable_columns = df.isna().any()
---> 15 inferred_column_dtypes = {col: _get_array_type(df[col]) for col in df}
     16 column_statistics = {
     17     col: {
     18         "dtype": dtype,
   (...)
     22     for col, dtype in inferred_column_dtypes.items()
     23 }
     24 return {
     25     "columns": column_statistics if column_statistics else None,
     26     "index": infer_index_statistics(df.index),
     27 }

File ~/.envs/menv/lib/python3.10/site-packages/pandera/schema_statistics/pandas.py:15, in <dictcomp>(.0)
     13 """Infer column and index statistics from a pandas DataFrame."""
     14 nullable_columns = df.isna().any()
---> 15 inferred_column_dtypes = {col: _get_array_type(df[col]) for col in df}
     16 column_statistics = {
     17     col: {
     18         "dtype": dtype,
   (...)
     22     for col, dtype in inferred_column_dtypes.items()
     23 }
     24 return {
     25     "columns": column_statistics if column_statistics else None,
     26     "index": infer_index_statistics(df.index),
     27 }

File ~/.envs/menv/lib/python3.10/site-packages/pandera/schema_statistics/pandas.py:184, in _get_array_type(x)
    181 def _get_array_type(x):
    182     # get most granular type possible
--> 184     data_type = pandas_engine.Engine.dtype(x.dtype)
    185     # for object arrays, try to infer dtype
    186     if data_type is pandas_engine.Engine.dtype("object"):

File ~/.envs/menv/lib/python3.10/site-packages/pandera/engines/pandas_engine.py:209, in Engine.dtype(cls, data_type)
    206         common_np_dtype = np.dtype(np_or_pd_dtype.name)
    207         np_or_pd_dtype = common_np_dtype.type
--> 209 return engine.Engine.dtype(cls, np_or_pd_dtype)

File ~/.envs/menv/lib/python3.10/site-packages/pandera/engines/engine.py:265, in Engine.dtype(cls, data_type)
    263     return registry.dispatch(data_type)
    264 except (KeyError, ValueError):
--> 265     raise TypeError(
    266         f"Data type '{data_type}' not understood by {cls.__name__}."
    267     ) from None

TypeError: Data type 'string[pyarrow]' not understood by Engine.

Expected behavior

I want to be able to use Pandera with pyarrow backed dataframes

Versions:

Pandas : 2.0.2
Pandera: 0.15.2

Jul 15 '23 21:07 mattharrison

@mattharrison I think this would be a feature request: the current scope of pandera is that it doesn't yet support pyarrow datatypes/backend. Gonna close https://github.com/unionai-oss/pandera/issues/1162 and merge that with this issue

Jul 16 '23 02:07 cosmicBboy

Is there a workaround to make the validation to work with pyarrow types? Or do you have any idea when this will be implemented?

Sep 16 '23 03:09 franzoni315

I would also support the request to support arrow datatypes which I guess become the new normal in Pandas 2. My current workaround is to convert the arrow dtypes to nullable numpy before running pandera. df.convert_dtypes(infer_objects=False, dtype_backend='numpy_nullable')

Sep 17 '23 18:09 OliverKleinBST

I just want to point out that pyarrow will become a required dependency in pandas 3.0, and the arrow string datatype will become the default string datatype (although numeric types will continue to default to numpy types, IIUC):

https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html

Feb 20 '24 23:02 juanarrivillaga

Anyone who wants to create PR for this has my blessing!

A good place to start would be:

Dtype docs: https://pandera.readthedocs.io/en/stable/dtypes.html
Pandas engine implementation for datatypes: https://github.com/unionai-oss/pandera/blob/main/pandera/engines/pandas_engine.py

Apr 24 '24 16:04 cosmicBboy

@cosmicBboy i took a quick stab at it

adding this to pandas_engine.py

@Engine.register_dtype(equivalents=["int", pd.ArrowDtype(pyarrow.int64())])
@immutable
class ArrowINT64(DataType, dtypes.Int):
    type = pd.ArrowDtype(pyarrow.int64())
    bit_width: int = 64


@Engine.register_dtype(equivalents=["string", pd.ArrowDtype(pyarrow.string())])
@immutable
class ArrowString(DataType, dtypes.String):
    type = pd.ArrowDtype(pyarrow.string())

this gets validated

import pandas as pd
import pandera as pa

df = pd.DataFrame(
    [
        {"foo": 123, "bar": "abc"},
    ],
)


class Schema(pa.DataFrameModel):
    foo: int
    bar: str


print("pandas:")
print(df.dtypes)
print()
print(Schema.validate(df))
print()

df = df.convert_dtypes(dtype_backend="pyarrow")
print("pandas[pyarrow]:")
print(df.dtypes)
print()
print(Schema.validate(df))

output:

pandas:
foo     int64
bar    object
dtype: object

   foo  bar
0  123  abc

pandas[pyarrow]:
foo     int64[pyarrow]
bar    string[pyarrow]
dtype: object

   foo  bar
0  123  abc

would you like me to continue this direction?

May 03 '24 15:05 aaravind100

@aaravind100 the overall approach makes sense! Thanks for taking the initiative on this.

@Engine.register_dtype(equivalents=["int", pd.ArrowDtype(pyarrow.int64())])

Let's avoid overloading "int" here since it's already taken by the numpy int type: https://github.com/unionai-oss/pandera/blob/main/pandera/engines/numpy_engine.py#L163-L165

For the equivalents, pandera has taken the philosophy of "accept whatever pandas (or underlying dataframe library) accepts as dtypes. So this means:

the string alias, e.g. "int64[pyarrow]"
the ArrowDtype instance pd.ArrowDtype(pyarrow.int64())

Another thought here is instead of requiring users to wrap the pyarrow dtype in pd.ArrowDtype up front, we could also potentially do away with the need to wrap the type in pd.ArrowDtype(...) when specifying a pandera schema and just do it in the background (would be curious on your thoughts here).

import pandera as pa
import pyarrow

pa.DataFrameSchema({
    "foo": pa.Column(pyarrow.int64()),
    "bar": pa.Column(pyarrow.timestamp(unit="s")),
})

The benefit is it makes for more concise. As mentioned in the docs, we'll need to make sure to wrap these in pd.ArrowDtype under the hood for parameterized types like pyarrow.timestamp. This is necessary to support DataFrameModel-style schemas:

class Model(pa.DataFrameModel):
    foo: pyarrow.int64  # these need to be types, so pyarrow.int64() is invalid
    bar: pyarrow.timestamp = pa.Field(dtype_kwargs={"unit": "s"})

    # or using typing.Annotated
    bar: Annotated[pyarrow.timestamp, "s"]

So something like:

@Engine.register_dtype(equivalents=["int", pyarrow.int64, pyarrow.int64())])  # this makes sure plain pyarrow.int64 is accepted as dtype in the schema definition
@immutable
class ArrowInt64(DataType, dtypes.Int):
    type = pd.ArrowDtype(pyarrow.int64())  # we wrap this here
    bit_width: int = 64

For parameterized dtypes it'll be slightly more complicated

@Engine.register_dtype(equivalents=["int", pyarrow.timestamp, pyarrow.timestamp())])
@immutable
class ArrowTimestamp(DataType, dtypes.Timestamp):
    type: Optional[pd.ArrowDtype] = dataclasses.field(default=None, init=False)  # we'll set this in __post_init__
    bit_width: int = 64

    unit: Optional[str] = None
    tz: Optional[datetime.tzinfo] = None

    def __post_init__(self):
        type_ = pd.ArrowDtype(pyarrow.timestamp(self.unit, self.tz))
        object.__setattr__(self, "type", type_)

    # this handles creating an instance of ArrowTimestamp in the DataFrameModel
    # schema definition
    @classmethod
    def from_parametrized_dtype(cls, pyarrow_dtype: pyarrow_dtype.timestamp):
        return cls(unit=pyarrow_dtype.unit, tz=pyarrow_dtype.tz)  # type: ignore

May 04 '24 02:05 cosmicBboy

pandera has taken the philosophy of "accept whatever pandas (or underlying dataframe library) accepts as dtypes

thank you, that clears some confusion :)

The suggestion for using pyarrow.<type> does indeed make more sense to me. It also opens up schema/model interoperability with other dataframe library which uses pyarrow types.

May 04 '24 09:05 aaravind100

@mattharrison you'll be pleased to learn that #1628 has been merged :) the 0.20.0 release will have these changes. will probably cut a beta release in the next week or so if you wanted to play around with it

May 11 '24 01:05 cosmicBboy

👍

On Fri, May 10, 2024, 7:47 PM Niels Bantilan @.***> wrote:

@mattharrison https://github.com/mattharrison you'll be pleased to learn that #1628 https://github.com/unionai-oss/pandera/pull/1628 has been merged :) the 0.20.0 release will have these changes. will probably cut a beta release in the next week or so if you wanted to play around with it

— Reply to this email directly, view it on GitHub https://github.com/unionai-oss/pandera/issues/1262#issuecomment-2105440893, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA5E3P65HZZY76WYVM7GFDZBV2E3AVCNFSM6AAAAAA2LQWQN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBVGQ2DAOBZGM . You are receiving this because you were mentioned.Message ID: @.***>

May 11 '24 03:05 mattharrison

pandera pandera copied to clipboard

Pandas2 / pyarrow backend support

Code Sample, a copy-pastable example

Expected behavior

Versions:

pandera
pandera copied to clipboard