pandera
pandera copied to clipboard
Generate pyarrow schema from pandera schema
Is your feature request related to a problem? Please describe. Need to maintain the schema twice, once for the pandas dataframe and again for the pyarrow table. An example where we need both is writing partitioned parquet datasets.
Describe the solution you'd like Generate pyarrow schema from pandera schema.
I plan to implement this over the Christmas holidays.
@cristianmatache Any chance you made any headway on this?
@justinlboyer not really, i recently changed jobs so i currently have a lot on my plate. Happy to guide you though, if you would be up for implementing it.
@justinlboyer , did you ever take a look at this? This would be useful, though I'm assuming it would be limited in its implementation, i.e. using pyarrow.list_(pyarrow.float64())
would not be supported, as there's no implementation of complex types like this in pandera
(that I'm aware of?)
If a basic implementation is satisfactory (i.e. not able to handle complex types like the list example above), I'd be up for collaborating on this.
@the-matt-morris I did not, we don't need it much anymore, but I'm happy to help out, feel free to ping me.
hey @the-matt-morris the basic implementation would be a first good step! (i.e. support for primitive/scalar data types)
This related to #260, support for things like pyarrow.list_(pyarrow.float64())
would be blocked by that.
@cosmicBboy , cool! Well I can take a stab at a PR on this...thinking would be a DataFrameSchema
method that returns the pyarrow schema. Obviously will need to create data type mapping to pyarrow types somewhere. Am I oversimplifying this?
thinking would be a DataFrameSchema method that returns the pyarrow schema
I'd consider this part of the pandera[io]
extra, with the additional pyarrow
library dependency.
My recommendation would be to implement a to_pyarrow_schema
in the io
module. For now I'd hesitate adding it as a DataFrameSchema
method so the API surface of the class stays (relatively) small -- I imagine more of these to/from_{schema_format}
functions will be implemented in the future, and a reasonable UX for it would be pandera.io.to/from_{schema_format}(dataframe_schema)
Obviously will need to create data type mapping to pyarrow types somewhere. Am I oversimplifying this?
Seems about right!
Is this PR close to being merged? This is an excellent feature I would be keen to leverage!
hi @louis-vines all current PRs are being blocked by https://github.com/unionai-oss/pandera/pull/913, which involves a signifant re-write of the pandera internals. Once that's merged (hopefully within the next 2 weeks) we'll circle back to incorporate all the recent PRs, including this one.
Excited for #913 !
Even once that is merged, I will need to go back and make a few updates to the PR anyways. I'd like to try out DataFrameSchema.empty()
in conjunction with pyarrow.Schema.from_pandas
, as it might be more robust than hardcoding all the mappings of dtypes to pyarrow
types that I did initially.
I see #913 is now merged (🥳). Any news on this one? Anything I could do to help?
Also checking in on the status of this please.
Hi @the-matt-morris
I'd also be happy to use this feature. Any chance you can update #1047 now that #913 is merged?
Thanks!
Checking in on the status. How can we further this along?