cudf icon indicating copy to clipboard operation
cudf copied to clipboard

[FEA] Remove usages of `pyorc` where not necessary

Open galipremsagar opened this issue 3 years ago • 5 comments

Is your feature request related to a problem? Please describe. Writing a pandas dataframe to an orc file using pyorc is a bit of a complex operation. Until now we have been using pyorc as a reference writer because we had no other choice, with the introduction of pyarrow's orc writer we should be making a switch from pyorc and this should remove a lot of complex handling that needs to be done for nested dtypes.

Describe the solution you'd like Drop pyorc usages to almost none - Though we will keep it probably for a few basic dtype tests to validate compatibilty. But fuzz-testing and the rest of pytests should make the switch.

Describe alternatives you've considered The FEA itself is a better alternative to pyorc 😉

Additional context https://arrow.apache.org/docs/python/generated/pyarrow.orc.write_table.html

galipremsagar avatar Aug 16 '22 15:08 galipremsagar

cc: @GregoryKimball @vuule

galipremsagar avatar Aug 16 '22 15:08 galipremsagar

Is the main difference in the PyORC's requirement to pass in a schema? Would it be possible to try this out in fuzz tests to verify that pyarrow is robust?

vuule avatar Aug 16 '22 22:08 vuule

Is the main difference in the PyORC's requirement to pass in a schema?

That + while the writer was an internal only API it lacked the stripe_size support which was a limiting factor to use that internal version of pyarrow's orc writer.

Would it be possible to try this out in fuzz tests to verify that pyarrow is robust?

Yup

galipremsagar avatar Aug 16 '22 22:08 galipremsagar

I definitely like the suggestion, pyarrow API looks very clean and... comprehensive (more so than ours 😬).

vuule avatar Aug 16 '22 23:08 vuule

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar Sep 16 '22 00:09 github-actions[bot]

https://github.com/rapidsai/cudf/pull/12103 solves part of the problem. However we will need to wait until pyarrow can write complex nested data-types to an orc file.

galipremsagar avatar Nov 14 '22 21:11 galipremsagar

This was completed in #14323

vyasr avatar May 17 '24 15:05 vyasr