pandera
pandera copied to clipboard
feat: add `pandera.io.to_pyarrow_schema`
closes #689
Introduces pandera.io.to_pyarrow_schema
@cosmicBboy , one thing I was unsure about was this type hint. mypy correctly identifies that if conflicts with this type hint. However, I'm not sure under any circumstances when a key in DataFrameSchema.columns is not a string? I'm making the assumption in this function that it is always a string. Tests pass, but perhaps there is a situation under which this would be a problem that aren't covered in the unit tests?
Other assumptions:
pyarrow.date64()type is used when thepanderadate data type cannot be inferred bypyarrow- This does not support a
DataFrameSchemawith field(s) that are not typed. I supposed we could potentially force those to something, saypyarrow.string(), but I don't like the feel of doing something like that. - No support for following types:
geopandasGeometriesFloat128. We could potentially implement this, would just have to make an assumption about the precision and roll with itComplex64,Complex128,Complex256
- Argument
preserve_indextopandera.io.to_pyarrow_schemafunctions similarly topreserve_indexargument topyarrow.Schema.from_pandas - As mentioned in the issue discussion, there is no support for complex types like
pyarrow.lint_(pyarrow.float64()).
Let me know if you feel like I missed any use cases in the unit tests.
I didn't run those mypy unit tests locally, I'll have to see what's going on there.
The other thing to consider is that this may all be moot. I see the PR for DataFrameSchema.empty(), and this whole thing could potentially be simply refactored to:
import pyarrow
pyarrow.Schema.from_pandas(dataframe_schema.empty())
What's the status of this PR? I have a use-case that requires this. Is there a different supported way or are we still waiting on this?
will need to resolve the merge conflicts and probably rebase this onto the current main branch.
@the-matt-morris not sure if you want to pick this up again. I do think leveraging an empty method would make sense to fulfill this use case.
However, the PR that implements the empty method hasn't seen much movement, the issue is still open.
I do think a workaround for this would be:
import pyarrow
schema = pa.DataFrameSchema(..., coerce=True)
empty_df = schema.coerce_dtype(pd.DataFrame(columns=[*schema.columns]))
pyarrow.Schema.from_pandas(empty_df)
@cosmicBboy this doesn't work for me. The schema infers all types as type null
class TodoList(pa.DataFrameModel):
int16: Series[pdt.Int16] = pa.Field()
int_list: Series[list[int]] = pa.Field()
str_list: Series[list[str]] = pa.Field()
int16_list: Series[list[pdt.Int16]] = pa.Field()
int16_List: Series[List[pdt.Int16]] = pa.Field()
def test_to_arrow():
import pandas as pd
import pyarrow
schema = TodoList.to_schema()
empty_df = schema.coerce_dtype(pd.DataFrame(columns=[*schema.columns]))
schema = pyarrow.Schema.from_pandas(empty_df)
logger.info(schema)
Output:
int16: null
int_list: null
str_list: null
int16_list: null
int16_List: null
yeah, tried this out and I think the approach in this PR (i.e. a dedicated pandera schema -> pyarrow schema translation layer) is the way to go. This is because for any non-scalar type (struct, list, dictionary, etc) I don't think pyarrow.Schema.from_pandas will be able to infer the dtype from an object column. Any pandera column with generics like List[TYPE] will be represented as an object dtype in pandas.
happy to review this or another PR that takes a crack at this, not sure if you want to continue tackling this @the-matt-morris
FYI - I have a local copy of this where I am modifying it to work for my use-case. I probably need some guidance though as I had to do some custom reflection to handle the typing library and am relatively new to python. Gist here: https://gist.github.com/sam-goodwin/85c44d0241f6848e4a183a39c1abfb58
Happy to contribute this back if @the-matt-morris isn't available to finish this PR.
It wasn't clear to me if i am suppsed to use = pa.Field() in NamedTuple or TypedDict:
class TodoItem(NamedTuple):
name: str
priority: int
pd_uint8: pdt.UInt8
I instead am using reflection and mapping based on the python types.
I see this in the original PR:
pandas_types = {
pd.BooleanDtype(): pa.bool_(),
pd.Int8Dtype(): pa.int8(),
pd.Int16Dtype(): pa.int16(),
pd.Int32Dtype(): pa.int32(),
pd.Int64Dtype(): pa.int64(),
pd.UInt8Dtype(): pa.uint8(),
pd.UInt16Dtype(): pa.uint16(),
pd.UInt32Dtype(): pa.uint32(),
pd.UInt64Dtype(): pa.uint64(),
pd.Float32Dtype(): pa.float32(), # type: ignore[attr-defined]
pd.Float64Dtype(): pa.float64(), # type: ignore[attr-defined]
pd.StringDtype(): pa.string(),
}
I am just doing this:
elif python_type is pdt.UInt8:
return pa.uint8()
elif python_type is pdt.UInt16:
return pa.uint16()
elif python_type is pdt.UInt32:
return pa.uint32()
elif python_type is pdt.UInt64:
return pa.uint64()
elif python_type is pdt.Int8:
return pa.int8()
elif python_type is pdt.Int16:
return pa.int16()
elif python_type is pdt.Int32:
return pa.int32()
elif python_type is pdt.Int64:
return pa.int64()
elif python_type is pdt.Float32:
return pa.float32()
elif python_type is pdt.Float64:
return pa.float64()
elif python_type is pdt.String:
return pa.string()
elif python_type is pdt.Bool:
Not sure what the trade-offs are.
The mapping approach is faster and simpler (it's O(1) since it's a lookup table). This would probably work for most of the the simple types. For things like lists and namedtuple types you'll have to use the if statements.
In any case, feel free to create a new PR and we can iterate there.
Hey @cosmicBboy I am sorry about this one. It has been a long time and I have a new github account (yes the name is nearly exactly the same :) anyways, I can take a look at this one again, rebase and get the tests to pass. I must have gotten distracted but looks like there is at least some interest in getting this working,
@the-matt-morris i recently forked and continued this work and have tested in production. Just didn't find the time to contribute it back. I'd be happy to open a PR or share a gist here with where I landed
@the-matt-morris i recently forked and continued this work and have tested in production. Just didn't find the time to contribute it back. I'd be happy to open a PR or share a gist here with where I landed
Oh that's great, thanks for picking it up! If you're nearly there I will stay out of your way, but let me know if you want me to contribute to it at all or have any questions on the approach I was taking.
@sam-goodwin Any pointers you could share on your approach?