pandera
pandera copied to clipboard
Pandas2 / pyarrow backend support
Describe the bug I can't generate a schema from a pyarrow-backed dataframe
Code Sample, a copy-pastable example
import io
import pandas as pd
import pandera
data = 'id,date\n0e90a7243dbb433fbfb24e23f08b0684,08-05-2022\nb6242783029545f1ac86be6b950ed6d7,30-04-2023\n'
df = pd.read_csv(io.StringIO(data), engine='pyarrow', dtype_backend='pyarrow')
print(pd.__version__)
pandera.infer_schema(df)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[127], line 8
6 df = pd.read_csv(io.StringIO(data), engine='pyarrow', dtype_backend='pyarrow')
7 print(pd.__version__)
----> 8 pandera.infer_schema(df)
File ~/.envs/menv/lib/python3.10/site-packages/pandera/schema_inference/pandas.py:39, in infer_schema(pandas_obj)
32 """Infer schema for pandas DataFrame or Series object.
33
34 :param pandas_obj: DataFrame or Series object to infer.
35 :returns: DataFrameSchema or SeriesSchema
36 :raises: TypeError if pandas_obj is not expected type.
37 """
38 if isinstance(pandas_obj, pd.DataFrame):
---> 39 return infer_dataframe_schema(pandas_obj)
40 elif isinstance(pandas_obj, pd.Series):
41 return infer_series_schema(pandas_obj)
File ~/.envs/menv/lib/python3.10/site-packages/pandera/schema_inference/pandas.py:73, in infer_dataframe_schema(df)
67 def infer_dataframe_schema(df: pd.DataFrame) -> DataFrameSchema:
68 """Infer a DataFrameSchema from a pandas DataFrame.
69
70 :param df: DataFrame object to infer.
71 :returns: DataFrameSchema
72 """
---> 73 df_statistics = infer_dataframe_statistics(df)
74 schema = DataFrameSchema(
75 columns={
76 colname: Column(
(...)
84 coerce=True,
85 )
86 schema._is_inferred = True
File ~/.envs/menv/lib/python3.10/site-packages/pandera/schema_statistics/pandas.py:15, in infer_dataframe_statistics(df)
13 """Infer column and index statistics from a pandas DataFrame."""
14 nullable_columns = df.isna().any()
---> 15 inferred_column_dtypes = {col: _get_array_type(df[col]) for col in df}
16 column_statistics = {
17 col: {
18 "dtype": dtype,
(...)
22 for col, dtype in inferred_column_dtypes.items()
23 }
24 return {
25 "columns": column_statistics if column_statistics else None,
26 "index": infer_index_statistics(df.index),
27 }
File ~/.envs/menv/lib/python3.10/site-packages/pandera/schema_statistics/pandas.py:15, in <dictcomp>(.0)
13 """Infer column and index statistics from a pandas DataFrame."""
14 nullable_columns = df.isna().any()
---> 15 inferred_column_dtypes = {col: _get_array_type(df[col]) for col in df}
16 column_statistics = {
17 col: {
18 "dtype": dtype,
(...)
22 for col, dtype in inferred_column_dtypes.items()
23 }
24 return {
25 "columns": column_statistics if column_statistics else None,
26 "index": infer_index_statistics(df.index),
27 }
File ~/.envs/menv/lib/python3.10/site-packages/pandera/schema_statistics/pandas.py:184, in _get_array_type(x)
181 def _get_array_type(x):
182 # get most granular type possible
--> 184 data_type = pandas_engine.Engine.dtype(x.dtype)
185 # for object arrays, try to infer dtype
186 if data_type is pandas_engine.Engine.dtype("object"):
File ~/.envs/menv/lib/python3.10/site-packages/pandera/engines/pandas_engine.py:209, in Engine.dtype(cls, data_type)
206 common_np_dtype = np.dtype(np_or_pd_dtype.name)
207 np_or_pd_dtype = common_np_dtype.type
--> 209 return engine.Engine.dtype(cls, np_or_pd_dtype)
File ~/.envs/menv/lib/python3.10/site-packages/pandera/engines/engine.py:265, in Engine.dtype(cls, data_type)
263 return registry.dispatch(data_type)
264 except (KeyError, ValueError):
--> 265 raise TypeError(
266 f"Data type '{data_type}' not understood by {cls.__name__}."
267 ) from None
TypeError: Data type 'string[pyarrow]' not understood by Engine.
Expected behavior
I want to be able to use Pandera with pyarrow backed dataframes
Versions:
- Pandas : 2.0.2
- Pandera: 0.15.2
@mattharrison I think this would be a feature request: the current scope of pandera is that it doesn't yet support pyarrow datatypes/backend. Gonna close https://github.com/unionai-oss/pandera/issues/1162 and merge that with this issue
Is there a workaround to make the validation to work with pyarrow types? Or do you have any idea when this will be implemented?
I would also support the request to support arrow datatypes which I guess become the new normal in Pandas 2.
My current workaround is to convert the arrow dtypes to nullable numpy before running pandera.
df.convert_dtypes(infer_objects=False, dtype_backend='numpy_nullable')
I just want to point out that pyarrow will become a required dependency in pandas 3.0, and the arrow string datatype will become the default string datatype (although numeric types will continue to default to numpy types, IIUC):
https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html
Anyone who wants to create PR for this has my blessing!
A good place to start would be:
- Dtype docs: https://pandera.readthedocs.io/en/stable/dtypes.html
- Pandas engine implementation for datatypes: https://github.com/unionai-oss/pandera/blob/main/pandera/engines/pandas_engine.py
@cosmicBboy i took a quick stab at it
adding this to pandas_engine.py
@Engine.register_dtype(equivalents=["int", pd.ArrowDtype(pyarrow.int64())])
@immutable
class ArrowINT64(DataType, dtypes.Int):
type = pd.ArrowDtype(pyarrow.int64())
bit_width: int = 64
@Engine.register_dtype(equivalents=["string", pd.ArrowDtype(pyarrow.string())])
@immutable
class ArrowString(DataType, dtypes.String):
type = pd.ArrowDtype(pyarrow.string())
this gets validated
import pandas as pd
import pandera as pa
df = pd.DataFrame(
[
{"foo": 123, "bar": "abc"},
],
)
class Schema(pa.DataFrameModel):
foo: int
bar: str
print("pandas:")
print(df.dtypes)
print()
print(Schema.validate(df))
print()
df = df.convert_dtypes(dtype_backend="pyarrow")
print("pandas[pyarrow]:")
print(df.dtypes)
print()
print(Schema.validate(df))
output:
pandas:
foo int64
bar object
dtype: object
foo bar
0 123 abc
pandas[pyarrow]:
foo int64[pyarrow]
bar string[pyarrow]
dtype: object
foo bar
0 123 abc
would you like me to continue this direction?
@aaravind100 the overall approach makes sense! Thanks for taking the initiative on this.
@Engine.register_dtype(equivalents=["int", pd.ArrowDtype(pyarrow.int64())])
Let's avoid overloading "int" here since it's already taken by the numpy int type: https://github.com/unionai-oss/pandera/blob/main/pandera/engines/numpy_engine.py#L163-L165
For the equivalents, pandera has taken the philosophy of "accept whatever pandas (or underlying dataframe library) accepts as dtypes. So this means:
- the string alias, e.g.
"int64[pyarrow]" - the ArrowDtype instance
pd.ArrowDtype(pyarrow.int64())
Another thought here is instead of requiring users to wrap the pyarrow dtype in pd.ArrowDtype up front, we could also potentially do away with the need to wrap the type in pd.ArrowDtype(...) when specifying a pandera schema and just do it in the background (would be curious on your thoughts here).
import pandera as pa
import pyarrow
pa.DataFrameSchema({
"foo": pa.Column(pyarrow.int64()),
"bar": pa.Column(pyarrow.timestamp(unit="s")),
})
The benefit is it makes for more concise. As mentioned in the docs, we'll need to make sure to wrap these in pd.ArrowDtype under the hood for parameterized types like pyarrow.timestamp. This is necessary to support DataFrameModel-style schemas:
class Model(pa.DataFrameModel):
foo: pyarrow.int64 # these need to be types, so pyarrow.int64() is invalid
bar: pyarrow.timestamp = pa.Field(dtype_kwargs={"unit": "s"})
# or using typing.Annotated
bar: Annotated[pyarrow.timestamp, "s"]
So something like:
@Engine.register_dtype(equivalents=["int", pyarrow.int64, pyarrow.int64())]) # this makes sure plain pyarrow.int64 is accepted as dtype in the schema definition
@immutable
class ArrowInt64(DataType, dtypes.Int):
type = pd.ArrowDtype(pyarrow.int64()) # we wrap this here
bit_width: int = 64
For parameterized dtypes it'll be slightly more complicated
@Engine.register_dtype(equivalents=["int", pyarrow.timestamp, pyarrow.timestamp())])
@immutable
class ArrowTimestamp(DataType, dtypes.Timestamp):
type: Optional[pd.ArrowDtype] = dataclasses.field(default=None, init=False) # we'll set this in __post_init__
bit_width: int = 64
unit: Optional[str] = None
tz: Optional[datetime.tzinfo] = None
def __post_init__(self):
type_ = pd.ArrowDtype(pyarrow.timestamp(self.unit, self.tz))
object.__setattr__(self, "type", type_)
# this handles creating an instance of ArrowTimestamp in the DataFrameModel
# schema definition
@classmethod
def from_parametrized_dtype(cls, pyarrow_dtype: pyarrow_dtype.timestamp):
return cls(unit=pyarrow_dtype.unit, tz=pyarrow_dtype.tz) # type: ignore
pandera has taken the philosophy of "accept whatever pandas (or underlying dataframe library) accepts as dtypes
thank you, that clears some confusion :)
The suggestion for using pyarrow.<type> does indeed make more sense to me. It also opens up schema/model interoperability with other dataframe library which uses pyarrow types.
@mattharrison you'll be pleased to learn that #1628 has been merged :) the 0.20.0 release will have these changes. will probably cut a beta release in the next week or so if you wanted to play around with it
👍
On Fri, May 10, 2024, 7:47 PM Niels Bantilan @.***> wrote:
@mattharrison https://github.com/mattharrison you'll be pleased to learn that #1628 https://github.com/unionai-oss/pandera/pull/1628 has been merged :) the 0.20.0 release will have these changes. will probably cut a beta release in the next week or so if you wanted to play around with it
— Reply to this email directly, view it on GitHub https://github.com/unionai-oss/pandera/issues/1262#issuecomment-2105440893, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA5E3P65HZZY76WYVM7GFDZBV2E3AVCNFSM6AAAAAA2LQWQN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBVGQ2DAOBZGM . You are receiving this because you were mentioned.Message ID: @.***>