cudf icon indicating copy to clipboard operation
cudf copied to clipboard

Refactor Parquet writer options and builders

Open etseidl opened this issue 1 year ago • 13 comments

Description

Adding options to the Parquet writer is made somewhat tedious by the duplication of code between the two current sets of options/builder classes; one each for the chunked and non-chunked Parquet writers. This PR pulls common options into a parent options class, and common setters into a parent builder class. The builder parent uses CRTP to allow chaining of options.

Checklist

  • [x] I am familiar with the Contributing Guidelines.
  • [x] New or existing tests cover these changes.
  • [x] The documentation is up to date with these changes.

etseidl avatar May 22 '24 23:05 etseidl

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

copy-pr-bot[bot] avatar May 22 '24 23:05 copy-pr-bot[bot]

cc @vuule @mhaseeb123

etseidl avatar May 22 '24 23:05 etseidl

Seems closely related to https://github.com/rapidsai/cudf/issues/15825, just for a different format, right? Just wondering if it's worth rolling that change in too; no worries if not.

vyasr avatar May 23 '24 02:05 vyasr

/ok to test

vyasr avatar May 23 '24 02:05 vyasr

Seems closely related to #15825, just for a different format, right? Just wondering if it's worth rolling that change in too; no worries if not.

Yeah, I noticed that issue after submitting this. I can take a look tomorrow.

etseidl avatar May 23 '24 03:05 etseidl

Wow, it seems like there are rooms for Parquet reader and ORC reader/writer too :+1:

ttnghia avatar May 23 '24 05:05 ttnghia

Thanks for the effort, as always @etseidl. I quickly went over the long-overdue changes now but will properly review them tomorrow or Friday. I was wondering if we need to update corresponding python bindings in python.pxd and perhaps python.pyx as well?

mhaseeb123 avatar May 23 '24 06:05 mhaseeb123

I was wondering if we need to update corresponding python bindings in python.pxd and perhaps python.pyx as well?

@mhaseeb123 I was worried about that when I started (due to an offline discussion with @vuule), but since the API didn't change, the cython bindings didn't have to change at all 🤯. It all just worked ("worked" == "all tests passed"...just ignore all those CI failures 🤣).

etseidl avatar May 23 '24 07:05 etseidl

Wow, it seems like there are rooms for Parquet reader and ORC reader/writer too 👍

Oh dear 😂

etseidl avatar May 23 '24 07:05 etseidl

I was wondering if we need to update corresponding python bindings in python.pxd and perhaps python.pyx as well?

@mhaseeb123 I was worried about that when I started (due to an offline discussion with @vuule), but since the API didn't change, the cython bindings didn't have to change at all 🤯. It all just worked ("worked" == "all tests passed"...just ignore all those CI failures 🤣).

While this is true, in the future if you make changes to add to the API of the base class you'll have a similar issue to what this PR is trying to address: you'll have to add those methods to both the chunked and non-chunked versions in Cython because they don't have the concept of the inheritance structure. If it's not too much trouble I would recommend making the changes in this PR to save yourself the trouble. Here's an example of how that inheritance looks in Cython.

To give a bit of context, the way that exposing C++ classes to Cython works is that pxd files are basically a promise to Cython that if it generates C++ code calling a certain function, that code will exist. So if you add something to a pxd file but never call that function in any pyx files, the generated C++ code will remain unchanged.

vyasr avatar May 23 '24 13:05 vyasr

While this is true, in the future if you make changes to add to the API of the base class you'll have a similar issue to what this PR is trying to address: you'll have to add those methods to both the chunked and non-chunked versions in Cython because they don't have the concept of the inheritance structure. If it's not too much trouble I would recommend making the changes in this PR to save yourself the trouble.

Drat, I thought I'd get away with no python changes :sob:. Thanks for pointing this out.

etseidl avatar May 23 '24 15:05 etseidl

Wow, it seems like there are rooms for Parquet reader and ORC reader/writer too 👍

Oh dear 😂

Yeah! I hope I can count on you to refactor one of these @ttnghia :)

vuule avatar May 23 '24 16:05 vuule

@vyasr I've made the python changes, please check them out. I hope I did the CRTP in cython correctly. Also, I just guessed at the formatting since pre-commit would only flag problems but not fix them. It seems happy now, but that might be a low bar.

etseidl avatar May 23 '24 18:05 etseidl

/ok to test

vuule avatar May 23 '24 20:05 vuule

/ok to test

vyasr avatar May 23 '24 23:05 vyasr

/ok to test

vyasr avatar May 23 '24 23:05 vyasr

Sorry for the bit of churn there, had to convince GH that everything was up to date as I thought it was after all the merges of upstream branches.

vyasr avatar May 23 '24 23:05 vyasr

I've never seen CRTP in Cython before. I'm glad it seemingly works fine! Approving, pending seeing builds and tests all pass; I'm trusting them to verify that Cython won't crash here unexpectedly.

@vyasr I swear I built this before...but just did a clean build that failed. Not entirely sure what the problem is, but redefining the child methods to return parquet_writer_options_builder_base.BuilderT& compiles and tests pass, but it makes me nervous. The one thing that is different from C++ to python is I omit the second template param to parquet_writer_options_builder_base since it's only needed for the return type of build(), which in cython I keep in the child classes. I'll test and see if that makes a difference. I'm at a loss as to why the code as is fails with 'parquet_writer_options_builder' is not a type identifier.

Edit: I cobbled together a forward declaration and a ctypedef and that too compiled. I'm still nervous though.

Edit 2: I'm dumb. The typedef is wrong, but near as I can tell, the two methods in parquet_writer_options_builder that were failing are never called, so CI passed. I'll try again tomorrow.

etseidl avatar May 23 '24 23:05 etseidl

/ok to test

hyperbolic2346 avatar May 24 '24 01:05 hyperbolic2346

I've pushed the troublesome methods back into the parent class for now and left a FIXME. The two methods in question are never actually called, so perhaps they can just be removed. It would be good to run to ground why having the extra setters in the child class where they belong won't compile. @vyasr

BTW, I did a test where I modified the code where the builders are used and added the two setters to the chain. Neither of the earlier hacks (ctypedef or using <parent>.BuilderT as the return type) actually worked. The current code does, but if these methods are called from a chunked_parquet_writer_options_builder an error is thrown at compile time, so at least non-functioning code won't make it into a release (assuming users cannot access the builders...maybe a bad assumption?).

etseidl avatar May 24 '24 16:05 etseidl

I've tried to reproduce the cython error by modifying the CRTP test from the cython repo, and have gotten as far as:

cdef extern from "curiously_recurring_template_pattern_GH1458_suport.h" namespace "foo" nogil:
    cdef cppclass shape_t:
        square_t()
        void set_x(int)

    cdef cppclass square_t(shape_t):
        square_t()

    cdef cppclass cube_t(shape_t):
        cube_t()
        @staticmethod
        Cube builder()
        @staticmethod
        Cube builder(int)

    cdef cppclass Base[T, Derived]:
        Base()
        T& build()
        int calculate()
        Derived& chain1()
        ...
        Derived& chain13()

    cdef cppclass Square(Base[square_t, Square]):
        Square()

    cdef cppclass Cube(
            Base[cube_t,
                 Cube]):
        Cube()
        Cube(int)
        Cube& chain20(int)

def test_derived(int x):
    """
    >>> test_derived(5)
    (8, 125)
    """
    cube_int = cube_t.builder()
    cube_int5 = cube_t.builder(x)
    cube_int.chain1().chain2().chain20(10).build().set_x(2)
    return (cube_int.calculate(), cube_int5.calculate())

and this compiles and passes. chain20() is the analog of the function that fails to compile. Haven't yet tried with a pxd file.

So the builder pattern w/ CRTP should be working. Maybe different eyes will see what I'm missing :frowning:.

etseidl avatar May 24 '24 20:05 etseidl

Yikes I'm sorry @etseidl I meant to respond earlier and then completely forgot about it. I appreciate how hard you've tried to get the Cython working! I am (unfortunately) not that surprised that it's proving tricky. If the example from the Cython test suite is working for you and seems that similar, my next guess would be some sort of circular dependency issue in our Cython due to how many things we're importing? It's probably not the best use of your time to try and track this down though, so I'm fine with you pushing the code back up to the base class in order to get this merged. Once this PR is merged I can test out the Cython in a follow-up and see what I come up with. Thank you for getting it this far!

vyasr avatar May 30 '24 22:05 vyasr

/ok to test

vuule avatar Jun 04 '24 19:06 vuule

/ok to test

hyperbolic2346 avatar Jun 07 '24 23:06 hyperbolic2346

/merge

vuule avatar Jun 10 '24 05:06 vuule

Yikes I'm sorry @etseidl I meant to respond earlier and then completely forgot about it. I appreciate how hard you've tried to get the Cython working! I am (unfortunately) not that surprised that it's proving tricky. If the example from the Cython test suite is working for you and seems that similar, my next guess would be some sort of circular dependency issue in our Cython due to how many things we're importing? It's probably not the best use of your time to try and track this down though, so I'm fine with you pushing the code back up to the base class in order to get this merged. Once this PR is merged I can test out the Cython in a follow-up and see what I come up with. Thank you for getting it this far!

See https://github.com/rapidsai/cudf/pull/15978

vyasr avatar Jun 11 '24 18:06 vyasr