pandera icon indicating copy to clipboard operation
pandera copied to clipboard

feat: add `pandera.io.to_pyarrow_schema`

Open the-matt-morris opened this issue 2 years ago • 14 comments

closes #689

Introduces pandera.io.to_pyarrow_schema

@cosmicBboy , one thing I was unsure about was this type hint. mypy correctly identifies that if conflicts with this type hint. However, I'm not sure under any circumstances when a key in DataFrameSchema.columns is not a string? I'm making the assumption in this function that it is always a string. Tests pass, but perhaps there is a situation under which this would be a problem that aren't covered in the unit tests?

Other assumptions:

  • pyarrow.date64() type is used when the pandera date data type cannot be inferred by pyarrow
  • This does not support a DataFrameSchema with field(s) that are not typed. I supposed we could potentially force those to something, say pyarrow.string(), but I don't like the feel of doing something like that.
  • No support for following types:
    • geopandas Geometries
    • Float128. We could potentially implement this, would just have to make an assumption about the precision and roll with it
    • Complex64, Complex128, Complex256
  • Argument preserve_index to pandera.io.to_pyarrow_schema functions similarly to preserve_index argument to pyarrow.Schema.from_pandas
  • As mentioned in the issue discussion, there is no support for complex types like pyarrow.lint_(pyarrow.float64()).

Let me know if you feel like I missed any use cases in the unit tests.

the-matt-morris avatar Dec 08 '22 22:12 the-matt-morris

I didn't run those mypy unit tests locally, I'll have to see what's going on there.

The other thing to consider is that this may all be moot. I see the PR for DataFrameSchema.empty(), and this whole thing could potentially be simply refactored to:

import pyarrow
pyarrow.Schema.from_pandas(dataframe_schema.empty())

the-matt-morris avatar Dec 08 '22 23:12 the-matt-morris

What's the status of this PR? I have a use-case that requires this. Is there a different supported way or are we still waiting on this?

sam-goodwin avatar Mar 29 '24 22:03 sam-goodwin

will need to resolve the merge conflicts and probably rebase this onto the current main branch.

@the-matt-morris not sure if you want to pick this up again. I do think leveraging an empty method would make sense to fulfill this use case.

However, the PR that implements the empty method hasn't seen much movement, the issue is still open.

I do think a workaround for this would be:

import pyarrow

schema = pa.DataFrameSchema(..., coerce=True)
empty_df = schema.coerce_dtype(pd.DataFrame(columns=[*schema.columns]))
pyarrow.Schema.from_pandas(empty_df)

cosmicBboy avatar Apr 01 '24 21:04 cosmicBboy

@cosmicBboy this doesn't work for me. The schema infers all types as type null

class TodoList(pa.DataFrameModel):
    int16: Series[pdt.Int16] = pa.Field()
    int_list: Series[list[int]] = pa.Field()
    str_list: Series[list[str]] = pa.Field()
    int16_list: Series[list[pdt.Int16]] = pa.Field()
    int16_List: Series[List[pdt.Int16]] = pa.Field()

def test_to_arrow():
    import pandas as pd
    import pyarrow

    schema = TodoList.to_schema()
    empty_df = schema.coerce_dtype(pd.DataFrame(columns=[*schema.columns]))
    schema = pyarrow.Schema.from_pandas(empty_df)

    logger.info(schema)

Output:

int16: null
int_list: null
str_list: null
int16_list: null
int16_List: null

sam-goodwin avatar Apr 02 '24 00:04 sam-goodwin

yeah, tried this out and I think the approach in this PR (i.e. a dedicated pandera schema -> pyarrow schema translation layer) is the way to go. This is because for any non-scalar type (struct, list, dictionary, etc) I don't think pyarrow.Schema.from_pandas will be able to infer the dtype from an object column. Any pandera column with generics like List[TYPE] will be represented as an object dtype in pandas.

happy to review this or another PR that takes a crack at this, not sure if you want to continue tackling this @the-matt-morris

cosmicBboy avatar Apr 02 '24 01:04 cosmicBboy

FYI - I have a local copy of this where I am modifying it to work for my use-case. I probably need some guidance though as I had to do some custom reflection to handle the typing library and am relatively new to python. Gist here: https://gist.github.com/sam-goodwin/85c44d0241f6848e4a183a39c1abfb58

Happy to contribute this back if @the-matt-morris isn't available to finish this PR.

sam-goodwin avatar Apr 02 '24 01:04 sam-goodwin

It wasn't clear to me if i am suppsed to use = pa.Field() in NamedTuple or TypedDict:

class TodoItem(NamedTuple):
    name: str
    priority: int
    pd_uint8: pdt.UInt8

I instead am using reflection and mapping based on the python types.

I see this in the original PR:

pandas_types = {
    pd.BooleanDtype(): pa.bool_(),
    pd.Int8Dtype(): pa.int8(),
    pd.Int16Dtype(): pa.int16(),
    pd.Int32Dtype(): pa.int32(),
    pd.Int64Dtype(): pa.int64(),
    pd.UInt8Dtype(): pa.uint8(),
    pd.UInt16Dtype(): pa.uint16(),
    pd.UInt32Dtype(): pa.uint32(),
    pd.UInt64Dtype(): pa.uint64(),
    pd.Float32Dtype(): pa.float32(),  # type: ignore[attr-defined]
    pd.Float64Dtype(): pa.float64(),  # type: ignore[attr-defined]
    pd.StringDtype(): pa.string(),
}

I am just doing this:

elif python_type is pdt.UInt8:
        return pa.uint8()
    elif python_type is pdt.UInt16:
        return pa.uint16()
    elif python_type is pdt.UInt32:
        return pa.uint32()
    elif python_type is pdt.UInt64:
        return pa.uint64()
    elif python_type is pdt.Int8:
        return pa.int8()
    elif python_type is pdt.Int16:
        return pa.int16()
    elif python_type is pdt.Int32:
        return pa.int32()
    elif python_type is pdt.Int64:
        return pa.int64()
    elif python_type is pdt.Float32:
        return pa.float32()
    elif python_type is pdt.Float64:
        return pa.float64()
    elif python_type is pdt.String:
        return pa.string()
    elif python_type is pdt.Bool:

Not sure what the trade-offs are.

sam-goodwin avatar Apr 02 '24 01:04 sam-goodwin

The mapping approach is faster and simpler (it's O(1) since it's a lookup table). This would probably work for most of the the simple types. For things like lists and namedtuple types you'll have to use the if statements.

In any case, feel free to create a new PR and we can iterate there.

cosmicBboy avatar Apr 02 '24 01:04 cosmicBboy

Hey @cosmicBboy I am sorry about this one. It has been a long time and I have a new github account (yes the name is nearly exactly the same :) anyways, I can take a look at this one again, rebase and get the tests to pass. I must have gotten distracted but looks like there is at least some interest in getting this working,

themattmorris avatar Apr 19 '24 02:04 themattmorris

@the-matt-morris i recently forked and continued this work and have tested in production. Just didn't find the time to contribute it back. I'd be happy to open a PR or share a gist here with where I landed

sam-goodwin avatar Apr 19 '24 02:04 sam-goodwin

@the-matt-morris i recently forked and continued this work and have tested in production. Just didn't find the time to contribute it back. I'd be happy to open a PR or share a gist here with where I landed

Oh that's great, thanks for picking it up! If you're nearly there I will stay out of your way, but let me know if you want me to contribute to it at all or have any questions on the approach I was taking.

themattmorris avatar Apr 19 '24 02:04 themattmorris

@sam-goodwin Any pointers you could share on your approach?

pumpikano avatar Sep 06 '24 00:09 pumpikano