pandera icon indicating copy to clipboard operation
pandera copied to clipboard

Generate pyarrow schema from pandera schema

Open cristianmatache opened this issue 3 years ago • 16 comments

Is your feature request related to a problem? Please describe. Need to maintain the schema twice, once for the pandas dataframe and again for the pyarrow table. An example where we need both is writing partitioned parquet datasets.

Describe the solution you'd like Generate pyarrow schema from pandera schema.

I plan to implement this over the Christmas holidays.

cristianmatache avatar Nov 25 '21 13:11 cristianmatache

@cristianmatache Any chance you made any headway on this?

justinlboyer avatar Jun 27 '22 15:06 justinlboyer

@justinlboyer not really, i recently changed jobs so i currently have a lot on my plate. Happy to guide you though, if you would be up for implementing it.

cristianmatache avatar Jun 28 '22 00:06 cristianmatache

@justinlboyer , did you ever take a look at this? This would be useful, though I'm assuming it would be limited in its implementation, i.e. using pyarrow.list_(pyarrow.float64()) would not be supported, as there's no implementation of complex types like this in pandera (that I'm aware of?)

If a basic implementation is satisfactory (i.e. not able to handle complex types like the list example above), I'd be up for collaborating on this.

the-matt-morris avatar Nov 28 '22 15:11 the-matt-morris

@the-matt-morris I did not, we don't need it much anymore, but I'm happy to help out, feel free to ping me.

justinlboyer avatar Nov 29 '22 13:11 justinlboyer

hey @the-matt-morris the basic implementation would be a first good step! (i.e. support for primitive/scalar data types)

This related to #260, support for things like pyarrow.list_(pyarrow.float64()) would be blocked by that.

cosmicBboy avatar Nov 29 '22 19:11 cosmicBboy

@cosmicBboy , cool! Well I can take a stab at a PR on this...thinking would be a DataFrameSchema method that returns the pyarrow schema. Obviously will need to create data type mapping to pyarrow types somewhere. Am I oversimplifying this?

the-matt-morris avatar Nov 29 '22 19:11 the-matt-morris

thinking would be a DataFrameSchema method that returns the pyarrow schema

I'd consider this part of the pandera[io] extra, with the additional pyarrow library dependency.

My recommendation would be to implement a to_pyarrow_schema in the io module. For now I'd hesitate adding it as a DataFrameSchema method so the API surface of the class stays (relatively) small -- I imagine more of these to/from_{schema_format} functions will be implemented in the future, and a reasonable UX for it would be pandera.io.to/from_{schema_format}(dataframe_schema)

Obviously will need to create data type mapping to pyarrow types somewhere. Am I oversimplifying this?

Seems about right!

cosmicBboy avatar Nov 29 '22 21:11 cosmicBboy

Is this PR close to being merged? This is an excellent feature I would be keen to leverage!

louis-vines avatar Jan 19 '23 16:01 louis-vines

hi @louis-vines all current PRs are being blocked by https://github.com/unionai-oss/pandera/pull/913, which involves a signifant re-write of the pandera internals. Once that's merged (hopefully within the next 2 weeks) we'll circle back to incorporate all the recent PRs, including this one.

cosmicBboy avatar Jan 19 '23 19:01 cosmicBboy

Excited for #913 !

Even once that is merged, I will need to go back and make a few updates to the PR anyways. I'd like to try out DataFrameSchema.empty() in conjunction with pyarrow.Schema.from_pandas, as it might be more robust than hardcoding all the mappings of dtypes to pyarrow types that I did initially.

the-matt-morris avatar Jan 19 '23 19:01 the-matt-morris

I see #913 is now merged (🥳). Any news on this one? Anything I could do to help?

louis-vines avatar Feb 18 '23 15:02 louis-vines

Also checking in on the status of this please.

novemberkilo avatar Nov 06 '23 01:11 novemberkilo

Hi @the-matt-morris

I'd also be happy to use this feature. Any chance you can update #1047 now that #913 is merged?

Thanks!

Cakell avatar Jan 11 '24 10:01 Cakell

Checking in on the status. How can we further this along?

sam-goodwin avatar Mar 29 '24 22:03 sam-goodwin