cudf
cudf copied to clipboard
Refactor Parquet writer options and builders
Description
Adding options to the Parquet writer is made somewhat tedious by the duplication of code between the two current sets of options/builder classes; one each for the chunked and non-chunked Parquet writers. This PR pulls common options into a parent options class, and common setters into a parent builder class. The builder parent uses CRTP to allow chaining of options.
Checklist
- [x] I am familiar with the Contributing Guidelines.
- [x] New or existing tests cover these changes.
- [x] The documentation is up to date with these changes.
This pull request requires additional validation before any workflows can run on NVIDIA's runners.
Pull request vetters can view their responsibilities here.
Contributors can view more details about this message here.
cc @vuule @mhaseeb123
Seems closely related to https://github.com/rapidsai/cudf/issues/15825, just for a different format, right? Just wondering if it's worth rolling that change in too; no worries if not.
/ok to test
Seems closely related to #15825, just for a different format, right? Just wondering if it's worth rolling that change in too; no worries if not.
Yeah, I noticed that issue after submitting this. I can take a look tomorrow.
Wow, it seems like there are rooms for Parquet reader and ORC reader/writer too :+1:
Thanks for the effort, as always @etseidl. I quickly went over the long-overdue changes now but will properly review them tomorrow or Friday. I was wondering if we need to update corresponding python bindings in python.pxd and perhaps python.pyx as well?
I was wondering if we need to update corresponding python bindings in
python.pxdand perhapspython.pyxas well?
@mhaseeb123 I was worried about that when I started (due to an offline discussion with @vuule), but since the API didn't change, the cython bindings didn't have to change at all 🤯. It all just worked ("worked" == "all tests passed"...just ignore all those CI failures 🤣).
Wow, it seems like there are rooms for Parquet reader and ORC reader/writer too 👍
Oh dear 😂
I was wondering if we need to update corresponding python bindings in
python.pxdand perhapspython.pyxas well?@mhaseeb123 I was worried about that when I started (due to an offline discussion with @vuule), but since the API didn't change, the cython bindings didn't have to change at all 🤯. It all just worked ("worked" == "all tests passed"...just ignore all those CI failures 🤣).
While this is true, in the future if you make changes to add to the API of the base class you'll have a similar issue to what this PR is trying to address: you'll have to add those methods to both the chunked and non-chunked versions in Cython because they don't have the concept of the inheritance structure. If it's not too much trouble I would recommend making the changes in this PR to save yourself the trouble. Here's an example of how that inheritance looks in Cython.
To give a bit of context, the way that exposing C++ classes to Cython works is that pxd files are basically a promise to Cython that if it generates C++ code calling a certain function, that code will exist. So if you add something to a pxd file but never call that function in any pyx files, the generated C++ code will remain unchanged.
While this is true, in the future if you make changes to add to the API of the base class you'll have a similar issue to what this PR is trying to address: you'll have to add those methods to both the chunked and non-chunked versions in Cython because they don't have the concept of the inheritance structure. If it's not too much trouble I would recommend making the changes in this PR to save yourself the trouble.
Drat, I thought I'd get away with no python changes :sob:. Thanks for pointing this out.
Wow, it seems like there are rooms for Parquet reader and ORC reader/writer too 👍
Oh dear 😂
Yeah! I hope I can count on you to refactor one of these @ttnghia :)
@vyasr I've made the python changes, please check them out. I hope I did the CRTP in cython correctly. Also, I just guessed at the formatting since pre-commit would only flag problems but not fix them. It seems happy now, but that might be a low bar.
/ok to test
/ok to test
/ok to test
Sorry for the bit of churn there, had to convince GH that everything was up to date as I thought it was after all the merges of upstream branches.
I've never seen CRTP in Cython before. I'm glad it seemingly works fine! Approving, pending seeing builds and tests all pass; I'm trusting them to verify that Cython won't crash here unexpectedly.
@vyasr I swear I built this before...but just did a clean build that failed. Not entirely sure what the problem is, but redefining the child methods to return parquet_writer_options_builder_base.BuilderT& compiles and tests pass, but it makes me nervous. The one thing that is different from C++ to python is I omit the second template param to parquet_writer_options_builder_base since it's only needed for the return type of build(), which in cython I keep in the child classes. I'll test and see if that makes a difference. I'm at a loss as to why the code as is fails with 'parquet_writer_options_builder' is not a type identifier.
Edit: I cobbled together a forward declaration and a ctypedef and that too compiled. I'm still nervous though.
Edit 2: I'm dumb. The typedef is wrong, but near as I can tell, the two methods in parquet_writer_options_builder that were failing are never called, so CI passed. I'll try again tomorrow.
/ok to test
I've pushed the troublesome methods back into the parent class for now and left a FIXME. The two methods in question are never actually called, so perhaps they can just be removed. It would be good to run to ground why having the extra setters in the child class where they belong won't compile. @vyasr
BTW, I did a test where I modified the code where the builders are used and added the two setters to the chain. Neither of the earlier hacks (ctypedef or using <parent>.BuilderT as the return type) actually worked. The current code does, but if these methods are called from a chunked_parquet_writer_options_builder an error is thrown at compile time, so at least non-functioning code won't make it into a release (assuming users cannot access the builders...maybe a bad assumption?).
I've tried to reproduce the cython error by modifying the CRTP test from the cython repo, and have gotten as far as:
cdef extern from "curiously_recurring_template_pattern_GH1458_suport.h" namespace "foo" nogil:
cdef cppclass shape_t:
square_t()
void set_x(int)
cdef cppclass square_t(shape_t):
square_t()
cdef cppclass cube_t(shape_t):
cube_t()
@staticmethod
Cube builder()
@staticmethod
Cube builder(int)
cdef cppclass Base[T, Derived]:
Base()
T& build()
int calculate()
Derived& chain1()
...
Derived& chain13()
cdef cppclass Square(Base[square_t, Square]):
Square()
cdef cppclass Cube(
Base[cube_t,
Cube]):
Cube()
Cube(int)
Cube& chain20(int)
def test_derived(int x):
"""
>>> test_derived(5)
(8, 125)
"""
cube_int = cube_t.builder()
cube_int5 = cube_t.builder(x)
cube_int.chain1().chain2().chain20(10).build().set_x(2)
return (cube_int.calculate(), cube_int5.calculate())
and this compiles and passes. chain20() is the analog of the function that fails to compile. Haven't yet tried with a pxd file.
So the builder pattern w/ CRTP should be working. Maybe different eyes will see what I'm missing :frowning:.
Yikes I'm sorry @etseidl I meant to respond earlier and then completely forgot about it. I appreciate how hard you've tried to get the Cython working! I am (unfortunately) not that surprised that it's proving tricky. If the example from the Cython test suite is working for you and seems that similar, my next guess would be some sort of circular dependency issue in our Cython due to how many things we're importing? It's probably not the best use of your time to try and track this down though, so I'm fine with you pushing the code back up to the base class in order to get this merged. Once this PR is merged I can test out the Cython in a follow-up and see what I come up with. Thank you for getting it this far!
/ok to test
/ok to test
/merge
Yikes I'm sorry @etseidl I meant to respond earlier and then completely forgot about it. I appreciate how hard you've tried to get the Cython working! I am (unfortunately) not that surprised that it's proving tricky. If the example from the Cython test suite is working for you and seems that similar, my next guess would be some sort of circular dependency issue in our Cython due to how many things we're importing? It's probably not the best use of your time to try and track this down though, so I'm fine with you pushing the code back up to the base class in order to get this merged. Once this PR is merged I can test out the Cython in a follow-up and see what I come up with. Thank you for getting it this far!
See https://github.com/rapidsai/cudf/pull/15978