cudf Configurable blocksize mode for streaming executor in unit tests

This adds a new test option --blocksize-mode to the test runner, which lets us easily exercise the multi-partition code paths of the streaming executor.

When --blocksize-mode=small and --executor="streaming" the default behavior will be to create a GPUEngine with max_rows_per_partition=5 (which controls the partition size from in-memory dataframes) and target_partition_size=10 (which controls the partition size from parquet sources).

Towards https://github.com/rapidsai/cudf/issues/18928

Jun 12 '25 19:06 TomAugspurger

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Jun 12 '25 19:06 copy-pr-bot[bot]

The bad news: 196 tests fail with this. I'm starting to work through those in the issues linked from https://github.com/rapidsai/cudf/issues/18928. Fixes for those will probably go in separate PRs.

But I wanted to put this up a little early to discuss the general approach. For now, I've added a separate run of the full cudf_polars/tests suite with --blocksize-mode="small". This is probably sufficient to get the coverage we want, but is a bit of a blunt tool.

In particular, I don't the interaction between that second test run and the tests that want specific control over the GPU engine / configuration. Right now those tests will be run twice with the exact same configuration (--blocksize-mode, like --scheduler only has an effect when the engine passed to assert_gpu_result_equal is None.

As an alternative, we could make engine a required argument of assert_gpu_result_equal and force the test to think a bit about what it wants. The majority of our tests don't care, and so can just use an engine fixture, which we can parametrize over the blocksize-modes to get good coverage. That's a bunch of busywork to add an engine fixture to each test that doesn't already have it and will result in a bigger diff, but is a bit cleaner. Anyone have thoughts on whether that's worth doing?

Jun 12 '25 19:06 TomAugspurger

Quick status update here: We have two more PRs in the works that fix the last two correctness issues:

https://github.com/rapidsai/cudf/pull/19196
https://github.com/rapidsai/cudf/pull/19187

And then I'll push an update here that slightly modifies some of the tests (e.g. changing some assert parameters like check_order=False, skipping some tests) where the chunked streaming executor is expected to behave differently.

Jun 18 '25 22:06 TomAugspurger

I noticed that there's a assert_sink_result_equal, so I've added the same blocksize_mode argument in https://github.com/rapidsai/cudf/pull/19146/commits/38fb2cbdb7043deea73f35d72b3663d38266bbe5. That required adjusting the assert function a bit (scan_csv won't automatically scan a folder named foo.csv, you need a globstring in there somewhere).

I had to skip the test that uses a multi-line newline character \n\n, since we end up with an extra row of nulls per partition.

Jun 27 '25 12:06 TomAugspurger

I noticed that there's a assert_sink_result_equal, so I've added the same blocksize_mode argument in 38fb2cb. That required adjusting the assert function a bit (scan_csv won't automatically scan a folder named foo.csv, you need a globstring in there somewhere).

I had to skip the test that uses a multi-line newline character \n\n, since we end up with an extra row of nulls per partition.

Cool I think this PR is ready, just waiting on green CI.

Jun 27 '25 12:06 Matt711

/merge

Jun 30 '25 16:06 TomAugspurger