pandera icon indicating copy to clipboard operation
pandera copied to clipboard

Add support for pyarrow-backed arrays in Pandas

Open n-splv opened this issue 8 months ago • 1 comments

My data comes from a spark query: transform(split(features, ' '), x -> cast(x as int)) AS features_array

With spark I can validate it like this:

schema = DataFrameSchema(
    columns={
        "features": Column(T.ArrayType(T.IntegerType())),
    }
)

Once I save this spark df to parquet and read it with pandas, I get this dtype:

>df_pd.dtypes
feature   list<element: int32>[pyarrow]

And unfortunately there's no way (that I'm aware of) to validate such dataframe:

{
    "feature": pa.Column("list<element: int32>[pyarrow]"),  # TypeError: data type 'list<element: int32>[pyarrow]' not understood
    "feature": pa.Column(list[int]),  # schema error: expected list[int], got list<element: int32>[pyarrow]
}

n-splv avatar Mar 12 '25 18:03 n-splv

interesting! are you aware if pyarrow (or pandas or some other library) is able to parse those strings ("list<element: int32>[pyarrow]") into the actual pyarrow-native types?

cosmicBboy avatar Mar 13 '25 02:03 cosmicBboy