Vukasin Milovanovic comments

Results 139 comments of


                                            Vukasin Milovanovic

Add BGZIP multibyte_split benchmark

> This seems to be larger than just a benchmark. I think 22.10 needs to be merged so https://github.com/rapidsai/cudf/pull/11652 changes are excluded. @upsj do you want to keep this targeted...

[FEA] Remove usages of `pyorc` where not necessary

Is the main difference in the PyORC's requirement to pass in a schema? Would it be possible to try this out in fuzz tests to verify that pyarrow is robust?

[FEA] Remove usages of `pyorc` where not necessary

I definitely like the suggestion, pyarrow API looks very clean and... comprehensive (more so than ours 😬).

Add BGZIP `data_chunk_reader`

High level design question (realized this after seeing the benchmark code): can we separate compression type from the data source (file, host, device)? With the current implementation we can only...

[BUG] ORC string sum statistics are wrong

Additional info related to the pyorc part of the issue: Spark is able to read ORC string column statistics, and uses them for predicate based filtering.

[FEA] Expand ORC and Parquet benchmarks to cover different stripe/rowgroup sizes

Do you mean that we should add an axis that only has the default value?

[BUG] fix memory errors in cudf pytest

Thank you for digging into this @davidwendt This one is definitely for me to fix :) I should be able to work on this some time next week.

[BUG] fix memory errors in cudf pytest

Opened https://github.com/rapidsai/cudf/pull/13011 with a fix. Targeted the PR to 23.06 as we are already in burndown, but could be convinced to merge into 23.04 :)

Change the default dictionary policy in Parquet writer from `ALWAYS` to `ADAPTIVE`

There's also the Python side to be changed. Python writer has a bool `use_dictionary` which translates to either ALWAYS or NEVER. We probably need to change this part of the...

Refactor joins for conditional semis and antis

/ok to test