Configurable blocksize mode for streaming executor in unit tests
This adds a new test option --blocksize-mode to the test runner, which lets us easily exercise the multi-partition code paths of the streaming executor.
When --blocksize-mode=small and --executor="streaming" the default behavior will be to create a GPUEngine with max_rows_per_partition=5 (which controls the partition size from in-memory dataframes) and target_partition_size=10 (which controls the partition size from parquet sources).
Towards https://github.com/rapidsai/cudf/issues/18928
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.
Contributors can view more details about this message here.
The bad news: 196 tests fail with this. I'm starting to work through those in the issues linked from https://github.com/rapidsai/cudf/issues/18928. Fixes for those will probably go in separate PRs.
But I wanted to put this up a little early to discuss the general approach. For now, I've added a separate run of the full cudf_polars/tests suite with --blocksize-mode="small". This is probably sufficient to get the coverage we want, but is a bit of a blunt tool.
In particular, I don't the interaction between that second test run and the tests that want specific control over the GPU engine / configuration. Right now those tests will be run twice with the exact same configuration (--blocksize-mode, like --scheduler only has an effect when the engine passed to assert_gpu_result_equal is None.
As an alternative, we could make engine a required argument of assert_gpu_result_equal and force the test to think a bit about what it wants. The majority of our tests don't care, and so can just use an engine fixture, which we can parametrize over the blocksize-modes to get good coverage. That's a bunch of busywork to add an engine fixture to each test that doesn't already have it and will result in a bigger diff, but is a bit cleaner. Anyone have thoughts on whether that's worth doing?
Quick status update here: We have two more PRs in the works that fix the last two correctness issues:
- https://github.com/rapidsai/cudf/pull/19196
- https://github.com/rapidsai/cudf/pull/19187
And then I'll push an update here that slightly modifies some of the tests (e.g. changing some assert parameters like check_order=False, skipping some tests) where the chunked streaming executor is expected to behave differently.
I noticed that there's a assert_sink_result_equal, so I've added the same blocksize_mode argument in https://github.com/rapidsai/cudf/pull/19146/commits/38fb2cbdb7043deea73f35d72b3663d38266bbe5. That required adjusting the assert function a bit (scan_csv won't automatically scan a folder named foo.csv, you need a globstring in there somewhere).
I had to skip the test that uses a multi-line newline character \n\n, since we end up with an extra row of nulls per partition.
I noticed that there's a
assert_sink_result_equal, so I've added the sameblocksize_modeargument in 38fb2cb. That required adjusting the assert function a bit (scan_csvwon't automatically scan a folder namedfoo.csv, you need a globstring in there somewhere).I had to skip the test that uses a multi-line newline character
\n\n, since we end up with an extra row of nulls per partition.
Cool I think this PR is ready, just waiting on green CI.
/merge