[BUG]: Incorrect result in non-coalesce join with experimental streaming executor and multiple partitions
Describe the bug
The test python/cudf_polars/tests/test_join.py::test_non_coalesce_join[left-nulls_not_equal-join_expr0]
fails when using a small blocksize / multiple partitions.
Steps/Code to reproduce bug
Here's a simplified example
import polars as pl
from cudf_polars.testing.asserts import assert_gpu_result_equal
left = pl.LazyFrame(
{
"a": [1, 2, 3, 1, None],
"b": [1, 2, 3, 4, 5],
"c": [2, 3, 4, 5, 6],
}
)
right = pl.LazyFrame(
{
"a": [1, 4, 3, 7, None, None, 1],
"c": [2, 3, 4, 5, 6, 7, 8],
"d": [6, None, 7, 8, -1, 2, 4],
}
)
q = left.join(right, on=pl.col("a"), how="inner", nulls_equal=False, coalesce=False)
assert_gpu_result_equal(q, engine=pl.GPUEngine(executor="streaming", executor_options={"max_rows_per_partition": 3}))
which fails with
AssertionError: DataFrames are different (value mismatch for column 'a')
[left]: [1, 1, 3, 1, 1]
[right]: [1, 3, 1, 1, 1]
Expected behavior
Match polars / no error.
Oh, this is maybe just a row order issue. It passes with check_row_order=False. We need to confirm whether that's expected (and if so document that behavior).
Newer polars versions don't provide any ordering guarantees on left joins (and maybe we are doing it for backwards compat). I think this is just a "row order is not guaranteed and that's fine" problem.
I missed one set of errors previously that might be a distinct issue. https://github.com/rapidsai/cudf/blob/4466e55ba153ef13fa2327c3f0e00447aeb09b97/python/cudf_polars/tests/test_join.py#L91-L97 is a set of tests that does a join followed by a slice. That failing without a maintain_order (which we don't support) or a sort makes sense. But I was surprised by the error for pytest --executor=streaming --blocksize-mode=small python/cudf_polars/tests/test_join.py::test_left_join_with_slice[nulls_equal-zlice3]
AssertionError: DataFrames are different (number of rows does not match)
[left]: 2
[right]: 4
I would expect the same number of rows, but potentially different ones depending on which were selected. Maybe we're doing the slice on each partition and then concatenating each of those? Here's an isolated example:
import polars as pl
from cudf_polars.testing.asserts import assert_gpu_result_equal
left = pl.LazyFrame(
{
"a": [1, 2, 3, 1, None],
"b": [1, 2, 3, 4, 5],
"c": [2, 3, 4, 5, 6],
}
)
right = pl.LazyFrame(
{
"a": [1, 4, 3, 7, None, None, 1],
"c": [2, 3, 4, 5, 6, 7, 8],
"d": [6, None, 7, 8, -1, 2, 4],
}
)
q = left.join(right, on="a", how="inner").slice(0, 2)
engine = pl.GPUEngine(
executor="streaming",
raise_on_fail=True,
executor_options={"max_rows_per_partition": 3},
)
assert_gpu_result_equal(q, engine=engine)
And the failure:
Traceback (most recent call last):
File "/home/nfs/toaugspurger/gh/rapidsai/cudf/debug.py", line 21, in <module>
assert_gpu_result_equal(q, engine=engine)
File "/home/nfs/toaugspurger/gh/rapidsai/cudf/python/cudf_polars/cudf_polars/testing/asserts.py", line 134, in assert_gpu_result_equal
assert_frame_equal(
File "/raid/toaugspurger/envs/rapidsai/cudf/25.08/lib/python3.12/site-packages/polars/_utils/deprecation.py", line 128, in wrapper
return function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/raid/toaugspurger/envs/rapidsai/cudf/25.08/lib/python3.12/site-packages/polars/testing/asserts/frame.py", line 101, in assert_frame_equal
raise_assertion_error(
File "/raid/toaugspurger/envs/rapidsai/cudf/25.08/lib/python3.12/site-packages/polars/testing/asserts/utils.py", line 12, in raise_assertion_error
raise AssertionError(msg) from cause
AssertionError: DataFrames are different (number of rows does not match)
[left]: 2
[right]: 5
Aha - Good catch. It makes sense that we can't assume correct results for a slice after a Join. However, we definitely shouldn't get the wrong row count. It looks like we need to register lower_ir_node logic for Slice. We also need to update the lower_ir_node logic to Join to separate the fused slice operation into a distinct IR node (we need to be careful with every IR node that can contain a "fused" slice operation I guess).
Starting to work on a fix for the row-count problem: https://github.com/rapidsai/cudf/pull/19187
Resolved in #19187