Vukasin Milovanovic

Results 139 comments of Vukasin Milovanovic

> This seems to be larger than just a benchmark. I think 22.10 needs to be merged so https://github.com/rapidsai/cudf/pull/11652 changes are excluded. @upsj do you want to keep this targeted...

Is the main difference in the PyORC's requirement to pass in a schema? Would it be possible to try this out in fuzz tests to verify that pyarrow is robust?

I definitely like the suggestion, pyarrow API looks very clean and... comprehensive (more so than ours 😬).

High level design question (realized this after seeing the benchmark code): can we separate compression type from the data source (file, host, device)? With the current implementation we can only...

Additional info related to the pyorc part of the issue: Spark is able to read ORC string column statistics, and uses them for predicate based filtering.

Do you mean that we should add an axis that only has the default value?

Thank you for digging into this @davidwendt This one is definitely for me to fix :) I should be able to work on this some time next week.

Opened https://github.com/rapidsai/cudf/pull/13011 with a fix. Targeted the PR to 23.06 as we are already in burndown, but could be convinced to merge into 23.04 :)

There's also the Python side to be changed. Python writer has a bool `use_dictionary` which translates to either ALWAYS or NEVER. We probably need to change this part of the...