pandera
pandera copied to clipboard
Add support for pyarrow-backed arrays in Pandas
My data comes from a spark query:
transform(split(features, ' '), x -> cast(x as int)) AS features_array
With spark I can validate it like this:
schema = DataFrameSchema(
columns={
"features": Column(T.ArrayType(T.IntegerType())),
}
)
Once I save this spark df to parquet and read it with pandas, I get this dtype:
>df_pd.dtypes
feature list<element: int32>[pyarrow]
And unfortunately there's no way (that I'm aware of) to validate such dataframe:
{
"feature": pa.Column("list<element: int32>[pyarrow]"), # TypeError: data type 'list<element: int32>[pyarrow]' not understood
"feature": pa.Column(list[int]), # schema error: expected list[int], got list<element: int32>[pyarrow]
}
interesting! are you aware if pyarrow (or pandas or some other library) is able to parse those strings ("list<element: int32>[pyarrow]") into the actual pyarrow-native types?