pandera Generate pyarrow schema from pandera schema

Is your feature request related to a problem? Please describe. Need to maintain the schema twice, once for the pandas dataframe and again for the pyarrow table. An example where we need both is writing partitioned parquet datasets.

Describe the solution you'd like Generate pyarrow schema from pandera schema.

I plan to implement this over the Christmas holidays.

Nov 25 '21 13:11 cristianmatache

@cristianmatache Any chance you made any headway on this?

Jun 27 '22 15:06 justinlboyer

@justinlboyer not really, i recently changed jobs so i currently have a lot on my plate. Happy to guide you though, if you would be up for implementing it.

Jun 28 '22 00:06 cristianmatache

@justinlboyer , did you ever take a look at this? This would be useful, though I'm assuming it would be limited in its implementation, i.e. using pyarrow.list_(pyarrow.float64()) would not be supported, as there's no implementation of complex types like this in pandera (that I'm aware of?)

If a basic implementation is satisfactory (i.e. not able to handle complex types like the list example above), I'd be up for collaborating on this.

Nov 28 '22 15:11 the-matt-morris

@the-matt-morris I did not, we don't need it much anymore, but I'm happy to help out, feel free to ping me.

Nov 29 '22 13:11 justinlboyer

hey @the-matt-morris the basic implementation would be a first good step! (i.e. support for primitive/scalar data types)

This related to #260, support for things like pyarrow.list_(pyarrow.float64()) would be blocked by that.

Nov 29 '22 19:11 cosmicBboy

@cosmicBboy , cool! Well I can take a stab at a PR on this...thinking would be a DataFrameSchema method that returns the pyarrow schema. Obviously will need to create data type mapping to pyarrow types somewhere. Am I oversimplifying this?

Nov 29 '22 19:11 the-matt-morris

thinking would be a DataFrameSchema method that returns the pyarrow schema

I'd consider this part of the pandera[io] extra, with the additional pyarrow library dependency.

My recommendation would be to implement a to_pyarrow_schema in the io module. For now I'd hesitate adding it as a DataFrameSchema method so the API surface of the class stays (relatively) small -- I imagine more of these to/from_{schema_format} functions will be implemented in the future, and a reasonable UX for it would be pandera.io.to/from_{schema_format}(dataframe_schema)

Obviously will need to create data type mapping to pyarrow types somewhere. Am I oversimplifying this?

Seems about right!

Nov 29 '22 21:11 cosmicBboy

Is this PR close to being merged? This is an excellent feature I would be keen to leverage!

Jan 19 '23 16:01 louis-vines

hi @louis-vines all current PRs are being blocked by https://github.com/unionai-oss/pandera/pull/913, which involves a signifant re-write of the pandera internals. Once that's merged (hopefully within the next 2 weeks) we'll circle back to incorporate all the recent PRs, including this one.

Jan 19 '23 19:01 cosmicBboy

Excited for #913 !

Even once that is merged, I will need to go back and make a few updates to the PR anyways. I'd like to try out DataFrameSchema.empty() in conjunction with pyarrow.Schema.from_pandas, as it might be more robust than hardcoding all the mappings of dtypes to pyarrow types that I did initially.

Jan 19 '23 19:01 the-matt-morris

I see #913 is now merged (🥳). Any news on this one? Anything I could do to help?

Feb 18 '23 15:02 louis-vines

Also checking in on the status of this please.