[BUG]: NA values incorrectly filled with `False` in String ops with streaming executor and multiple partitions
Describe the bug
In string ops like .str.starts_with we incorrectly fill missing values with False instead of propagating the NA when using cudf-polars' streaming executor with multiple partitions.
Steps/Code to reproduce bug
import polars as pl
ldf = pl.LazyFrame({"a": ["a", 'b', None, 'b']})
q = ldf.select(pl.col("a").str.starts_with("a"))
q.collect(engine=pl.GPUEngine(executor="streaming", executor_options={"max_rows_per_partition": 2}))
outputs
shape: (4, 1)
┌───────┐
│ a │
│ --- │
│ bool │
╞═══════╡
│ true │
│ false │
│ false │
│ false │
└───────┘
Expected behavior
In [4]: q.collect()
Out[4]:
shape: (4, 1)
┌───────┐
│ a │
│ --- │
│ bool │
╞═══════╡
│ true │
│ false │
│ null │
│ false │
└───────┘
Additional context
Note that the dtype seems to be determined by the first partition. If we have missing values in the first / all partitions then we're fine.
q = pl.LazyFrame({"a": ["a", None, "a", None]}).select(pl.col("a").str.starts_with("a"))
q.collect(engine=pl.GPUEngine(executor="streaming", executor_options={"max_rows_per_partition": 2}))
shape: (4, 1)
┌──────┐
│ a │
│ --- │
│ bool │
╞══════╡
│ true │
│ null │
│ true │
│ null │
└──────┘
Here are some failing tests.
diff --git a/python/cudf_polars/tests/test_scan.py b/python/cudf_polars/tests/test_scan.py
index 922321830b..d27f8f96dd 100644
--- a/python/cudf_polars/tests/test_scan.py
+++ b/python/cudf_polars/tests/test_scan.py
@@ -462,3 +462,21 @@ def test_scan_csv_without_header_and_new_column_names_raises(df, tmp_path):
make_partitioned_source(df, path, "csv", write_kwargs={"include_header": False})
q = pl.scan_csv(path, has_header=False)
assert_ir_translation_raises(q, NotImplementedError)
+
+
+
+def test_string_na_na():
+ a = pl.LazyFrame({"a": ["a", "a", None, "a", "b", None, "b", "b"]})
+ engine = pl.GPUEngine(
+ executor="streaming", executor_options={"max_rows_per_partition": 4}
+ )
+ assert_gpu_result_equal(a, engine=engine)
+
+
+def test_string_ok_na():
+ a = pl.LazyFrame({"a": ["a", "a", "a", "a", "b", None, "b", "b"]})
+ engine = pl.GPUEngine(
+ executor="streaming", executor_options={"max_rows_per_partition": 4}
+ )
+ assert_gpu_result_equal(a, engine=engine)
+
The failing bits
E AssertionError: DataFrames are different (value mismatch for column 'a')
E [left]: ['a', 'a', None, 'a', 'b', None, 'b', 'b']
E [right]: ['a', 'a', None, 'a', 'b', '', None, 'b']
...
E AssertionError: DataFrames are different (value mismatch for column 'a')
E [left]: ['a', 'a', 'a', 'a', 'b', None, 'b', 'b']
E [right]: ['a', 'a', 'a', 'a', 'b', '', 'b', 'b']
I kind of expected a failure on the second one (where the first partition is all valid and the second partition is has some NA). The first failure is surprising, and possibly unrelated...
I don't see a failure for either combination with integers rather than strings. So how polars / pylibcudf represent strings might matter...
Crystal ball: the ingest of the sliced dataframe with strings has a bug. If you set TO_ARROW_COMPAT_LEVEL from cudf_polars/utils/dtypes.py to pl.CompatLevel.oldest() do things work?
That still fails:
================================================================================================================================================================= FAILURES ==================================================================================================================================================================
_____________________________________________________________________________________________________________________________________________________________ test_string_na_na _____________________________________________________________________________________________________________________________________________________________
E AssertionError: Series are different (exact value mismatch)
[left]: ['a', 'a', None, 'a', 'b', None, 'b', 'b']
[right]: ['a', 'a', None, 'a', 'b', '', None, 'b']
All traceback entries are hidden. Pass `--full-trace` to see hidden and internal frames.
The above exception was the direct cause of the following exception:
python/cudf_polars/tests/test_scan.py:473: in test_string_na_na
assert_gpu_result_equal(a, engine=engine)
python/cudf_polars/cudf_polars/testing/asserts.py:134: in assert_gpu_result_equal
assert_frame_equal(
/raid/toaugspurger/envs/rapidsai/cudf/25.08/lib/python3.12/site-packages/polars/_utils/deprecation.py:128: in wrapper
return function(*args, **kwargs)
E AssertionError: DataFrames are different (value mismatch for column 'a')
E [left]: ['a', 'a', None, 'a', 'b', None, 'b', 'b']
E [right]: ['a', 'a', None, 'a', 'b', '', None, 'b']
_____________________________________________________________________________________________________________________________________________________________ test_string_ok_na _____________________________________________________________________________________________________________________________________________________________
E AssertionError: Series are different (exact value mismatch)
[left]: ['a', 'a', 'a', 'a', 'b', None, 'b', 'b']
[right]: ['a', 'a', 'a', 'a', 'b', '', 'b', 'b']
All traceback entries are hidden. Pass `--full-trace` to see hidden and internal frames.
The above exception was the direct cause of the following exception:
python/cudf_polars/tests/test_scan.py:481: in test_string_ok_na
assert_gpu_result_equal(a, engine=engine)
python/cudf_polars/cudf_polars/testing/asserts.py:134: in assert_gpu_result_equal
assert_frame_equal(
/raid/toaugspurger/envs/rapidsai/cudf/25.08/lib/python3.12/site-packages/polars/_utils/deprecation.py:128: in wrapper
return function(*args, **kwargs)
E AssertionError: DataFrames are different (value mismatch for column 'a')
E [left]: ['a', 'a', 'a', 'a', 'b', None, 'b', 'b']
E [right]: ['a', 'a', 'a', 'a', 'b', '', 'b', 'b']
Here are the raw buffers for expect and got with
(Pdb) pp expect.to_arrow().columns[0].chunks[0].buffers()[0].to_pybytes()
b'\xdb'
(Pdb) pp got.to_arrow().columns[0].chunks[0].buffers()[0].to_pybytes()
b'\xbb'
(Pdb) pp expect.to_arrow().columns[0].chunks[0].buffers()[1].to_pybytes()
(b'\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00'
b'\x02\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00'
b'\x03\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00'
b'\x04\x00\x00\x00\x00\x00\x00\x00\x05\x00\x00\x00\x00\x00\x00\x00'
b'\x06\x00\x00\x00\x00\x00\x00\x00')
(Pdb) pp got.to_arrow().columns[0].chunks[0].buffers()[1].to_pybytes()
(b'\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00'
b'\x02\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00'
b'\x03\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00'
b'\x04\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00'
b'\x05\x00\x00\x00\x00\x00\x00\x00')
(Pdb) pp expect.to_arrow().columns[0].chunks[0].buffers()[2].to_pybytes()
b'aaabbb'
(Pdb) pp got.to_arrow().columns[0].chunks[0].buffers()[2].to_pybytes()
b'aaabb'
They do differ from eachother.
And here are the values using compat_level=1 for the to_arrow call
here in pdb (but note the value in cudf_polars/utils/dtypes.py was set
to what's on main, which for me is pl.CompatLevel.newest()):
(Pdb) pp expect.to_arrow(compat_level=1).columns[0].chunks[0].buffers()[0].to_pybytes()
b'\xdb'
(Pdb) pp got.to_arrow(compat_level=1).columns[0].chunks[0].buffers()[0].to_pybytes()
b'\xbb'
(Pdb) pp expect.to_arrow(compat_level=1).columns[0].chunks[0].buffers()[1].to_pybytes()
(b'\x01\x00\x00\x00a\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
b'\x01\x00\x00\x00a\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
b'\x01\x00\x00\x00a\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
b'\x01\x00\x00\x00b\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
b'\x01\x00\x00\x00b\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
b'\x01\x00\x00\x00b\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00')
(Pdb) pp got.to_arrow(compat_level=1).columns[0].chunks[0].buffers()[1].to_pybytes()
(b'\x01\x00\x00\x00a\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
b'\x01\x00\x00\x00a\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
b'\x01\x00\x00\x00a\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
b'\x01\x00\x00\x00b\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
b'\x01\x00\x00\x00b\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00')
Simplifying the reproducer a bit:
import cudf_polars.containers
import polars as pl
import polars.testing.asserts
df = pl.DataFrame({"a": ["a", None]})
# this is buggy
result = cudf_polars.containers.DataFrame.from_polars(df.tail(1)).to_polars()
expected = df.tail(1)
polars.testing.asserts.assert_frame_equal(result, expected)
fails with
File /raid/toaugspurger/envs/rapidsai/cudf/25.08/lib/python3.12/site-packages/polars/testing/asserts/utils.py:12, in raise_assertion_error(objects, detail, left, right, cause)
10 __tracebackhide__ = True
11 msg = f"{objects} are different ({detail})\n[left]: {left}\n[right]: {right}"
---> 12 raise AssertionError(msg) from cause
AssertionError: DataFrames are different (value mismatch for column 'a')
[left]: ['']
[right]: [None]
You're probably right about this being related to the compat level though. If we manually do the conversion and print what we get for various compat levels:
import cudf_polars.containers
import polars as pl
import polars.testing.asserts
import pylibcudf as plc
df = pl.DataFrame({"a": ["a", None]})
# this is buggy
result = cudf_polars.containers.DataFrame.from_polars(df.tail(1)).to_polars()
expected = df.tail(1)
table = plc.Table.from_arrow(df.tail(1)) # this is waht DataFrame.from_polars does
print("no to_arrow", plc.interop.to_arrow(table).columns[0][0].as_py()) # ""
table = plc.Table.from_arrow(df.tail(1).to_arrow())
print("to_arrow-default", plc.interop.to_arrow(table).columns[0][0].as_py()) # None
table = plc.Table.from_arrow(df.tail(1).to_arrow(compat_level=0))
print("to_arrow-compat_level=0", plc.interop.to_arrow(table).columns[0][0].as_py()) # None
table = plc.Table.from_arrow(df.tail(1).to_arrow(compat_level=1))
print("to_arrow-compat_level=1", plc.interop.to_arrow(table).columns[0][0].as_py()) # ""
that prints
no to_arrow
to_arrow-default None
to_arrow-compat_level=0 None
to_arrow-compat_level=1
Note that there are empty strings for the first and last case.
The fact that plc.Table.from_arrow(dataframe) and plc.Table.from_arrow(dataframe.to_arrow()) are different is maybe a bit concerning. IIUC, plc.Table.from_arrow(polars.DataFrame) will use polars' __arrow_c_stream__ method.
Phew, this probably does end up in pylibcudf's handling of validity maps on sliced columns like Lawrence guessed. https://github.com/rapidsai/cudf/issues/19159 is a clean issue with a reproducer. I'm not yet sure if that's the root cause of this original error, but if I had to guess I'd say they're the same. I'll check once that's fixed.
@TomAugspurger Can you retry this with the fix in https://github.com/rapidsai/cudf/pull/19174 ?
Thanks @davidwendt, I can confirm that https://github.com/rapidsai/cudf/pull/19174 fixes both https://github.com/rapidsai/cudf/issues/19159 and this issue.
@TomAugspurger Can this be closed now that #19174 is merged?
Yes, thanks. We don't yet have explicit test coverage for this in cudf-polars, but that'll come as part of https://github.com/rapidsai/cudf/pull/19146 so I think we're good here.