cudf [BUG]: NA values incorrectly filled with `False` in String ops with streaming executor and multiple partitions

Describe the bug

In string ops like .str.starts_with we incorrectly fill missing values with False instead of propagating the NA when using cudf-polars' streaming executor with multiple partitions.

Steps/Code to reproduce bug

import polars as pl

ldf = pl.LazyFrame({"a": ["a", 'b', None, 'b']})
q = ldf.select(pl.col("a").str.starts_with("a"))

q.collect(engine=pl.GPUEngine(executor="streaming", executor_options={"max_rows_per_partition": 2}))

outputs

shape: (4, 1)
┌───────┐
│ a     │
│ ---   │
│ bool  │
╞═══════╡
│ true  │
│ false │
│ false │
│ false │
└───────┘

Expected behavior

In [4]: q.collect()
Out[4]: 
shape: (4, 1)
┌───────┐
│ a     │
│ ---   │
│ bool  │
╞═══════╡
│ true  │
│ false │
│ null  │
│ false │
└───────┘

Additional context

Note that the dtype seems to be determined by the first partition. If we have missing values in the first / all partitions then we're fine.


q = pl.LazyFrame({"a": ["a", None, "a", None]}).select(pl.col("a").str.starts_with("a"))
q.collect(engine=pl.GPUEngine(executor="streaming", executor_options={"max_rows_per_partition": 2}))
shape: (4, 1)
┌──────┐
│ a    │
│ ---  │
│ bool │
╞══════╡
│ true │
│ null │
│ true │
│ null │
└──────┘

Jun 12 '25 20:06 TomAugspurger

Here are some failing tests.

diff --git a/python/cudf_polars/tests/test_scan.py b/python/cudf_polars/tests/test_scan.py
index 922321830b..d27f8f96dd 100644
--- a/python/cudf_polars/tests/test_scan.py
+++ b/python/cudf_polars/tests/test_scan.py
@@ -462,3 +462,21 @@ def test_scan_csv_without_header_and_new_column_names_raises(df, tmp_path):
     make_partitioned_source(df, path, "csv", write_kwargs={"include_header": False})
     q = pl.scan_csv(path, has_header=False)
     assert_ir_translation_raises(q, NotImplementedError)
+
+
+
+def test_string_na_na():
+    a = pl.LazyFrame({"a": ["a", "a", None, "a", "b", None, "b", "b"]})
+    engine = pl.GPUEngine(
+        executor="streaming", executor_options={"max_rows_per_partition": 4}
+    )
+    assert_gpu_result_equal(a, engine=engine)
+
+
+def test_string_ok_na():
+    a = pl.LazyFrame({"a": ["a", "a", "a", "a", "b", None, "b", "b"]})
+    engine = pl.GPUEngine(
+        executor="streaming", executor_options={"max_rows_per_partition": 4}
+    )
+    assert_gpu_result_equal(a, engine=engine)
+

The failing bits

E   AssertionError: DataFrames are different (value mismatch for column 'a')
E   [left]:  ['a', 'a', None, 'a', 'b', None, 'b', 'b']
E   [right]: ['a', 'a', None, 'a', 'b', '', None, 'b']
...
E   AssertionError: DataFrames are different (value mismatch for column 'a')
E   [left]:  ['a', 'a', 'a', 'a', 'b', None, 'b', 'b']
E   [right]: ['a', 'a', 'a', 'a', 'b', '', 'b', 'b']

I kind of expected a failure on the second one (where the first partition is all valid and the second partition is has some NA). The first failure is surprising, and possibly unrelated...

I don't see a failure for either combination with integers rather than strings. So how polars / pylibcudf represent strings might matter...

Jun 13 '25 15:06 TomAugspurger

Crystal ball: the ingest of the sliced dataframe with strings has a bug. If you set TO_ARROW_COMPAT_LEVEL from cudf_polars/utils/dtypes.py to pl.CompatLevel.oldest() do things work?

Jun 13 '25 15:06 wence-

That still fails:

================================================================================================================================================================= FAILURES ==================================================================================================================================================================
_____________________________________________________________________________________________________________________________________________________________ test_string_na_na _____________________________________________________________________________________________________________________________________________________________
E   AssertionError: Series are different (exact value mismatch)
    [left]:  ['a', 'a', None, 'a', 'b', None, 'b', 'b']
    [right]: ['a', 'a', None, 'a', 'b', '', None, 'b']
All traceback entries are hidden. Pass `--full-trace` to see hidden and internal frames.

The above exception was the direct cause of the following exception:
python/cudf_polars/tests/test_scan.py:473: in test_string_na_na
    assert_gpu_result_equal(a, engine=engine)
python/cudf_polars/cudf_polars/testing/asserts.py:134: in assert_gpu_result_equal
    assert_frame_equal(
/raid/toaugspurger/envs/rapidsai/cudf/25.08/lib/python3.12/site-packages/polars/_utils/deprecation.py:128: in wrapper
    return function(*args, **kwargs)
E   AssertionError: DataFrames are different (value mismatch for column 'a')
E   [left]:  ['a', 'a', None, 'a', 'b', None, 'b', 'b']
E   [right]: ['a', 'a', None, 'a', 'b', '', None, 'b']
_____________________________________________________________________________________________________________________________________________________________ test_string_ok_na _____________________________________________________________________________________________________________________________________________________________
E   AssertionError: Series are different (exact value mismatch)
    [left]:  ['a', 'a', 'a', 'a', 'b', None, 'b', 'b']
    [right]: ['a', 'a', 'a', 'a', 'b', '', 'b', 'b']
All traceback entries are hidden. Pass `--full-trace` to see hidden and internal frames.

The above exception was the direct cause of the following exception:
python/cudf_polars/tests/test_scan.py:481: in test_string_ok_na
    assert_gpu_result_equal(a, engine=engine)
python/cudf_polars/cudf_polars/testing/asserts.py:134: in assert_gpu_result_equal
    assert_frame_equal(
/raid/toaugspurger/envs/rapidsai/cudf/25.08/lib/python3.12/site-packages/polars/_utils/deprecation.py:128: in wrapper
    return function(*args, **kwargs)
E   AssertionError: DataFrames are different (value mismatch for column 'a')
E   [left]:  ['a', 'a', 'a', 'a', 'b', None, 'b', 'b']
E   [right]: ['a', 'a', 'a', 'a', 'b', '', 'b', 'b']

Here are the raw buffers for expect and got with

(Pdb) pp expect.to_arrow().columns[0].chunks[0].buffers()[0].to_pybytes()
b'\xdb'
(Pdb) pp got.to_arrow().columns[0].chunks[0].buffers()[0].to_pybytes()
b'\xbb'
(Pdb) pp expect.to_arrow().columns[0].chunks[0].buffers()[1].to_pybytes()
(b'\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00'
 b'\x02\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00'
 b'\x03\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00'
 b'\x04\x00\x00\x00\x00\x00\x00\x00\x05\x00\x00\x00\x00\x00\x00\x00'
 b'\x06\x00\x00\x00\x00\x00\x00\x00')
(Pdb) pp got.to_arrow().columns[0].chunks[0].buffers()[1].to_pybytes()
(b'\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00'
 b'\x02\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00'
 b'\x03\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00'
 b'\x04\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00'
 b'\x05\x00\x00\x00\x00\x00\x00\x00')
(Pdb) pp expect.to_arrow().columns[0].chunks[0].buffers()[2].to_pybytes()
b'aaabbb'
(Pdb) pp got.to_arrow().columns[0].chunks[0].buffers()[2].to_pybytes()
b'aaabb'

They do differ from eachother.

And here are the values using compat_level=1 for the to_arrow call here in pdb (but note the value in cudf_polars/utils/dtypes.py was set to what's on main, which for me is pl.CompatLevel.newest()):

(Pdb) pp expect.to_arrow(compat_level=1).columns[0].chunks[0].buffers()[0].to_pybytes()
b'\xdb'
(Pdb) pp got.to_arrow(compat_level=1).columns[0].chunks[0].buffers()[0].to_pybytes()
b'\xbb'
(Pdb) pp expect.to_arrow(compat_level=1).columns[0].chunks[0].buffers()[1].to_pybytes()
(b'\x01\x00\x00\x00a\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
 b'\x01\x00\x00\x00a\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
 b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
 b'\x01\x00\x00\x00a\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
 b'\x01\x00\x00\x00b\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
 b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
 b'\x01\x00\x00\x00b\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
 b'\x01\x00\x00\x00b\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00')
(Pdb) pp got.to_arrow(compat_level=1).columns[0].chunks[0].buffers()[1].to_pybytes()
(b'\x01\x00\x00\x00a\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
 b'\x01\x00\x00\x00a\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
 b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
 b'\x01\x00\x00\x00a\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
 b'\x01\x00\x00\x00b\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
 b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
 b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
 b'\x01\x00\x00\x00b\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00')

Jun 13 '25 16:06 TomAugspurger

Simplifying the reproducer a bit:

import cudf_polars.containers
import polars as pl
import polars.testing.asserts

df = pl.DataFrame({"a": ["a", None]})
# this is buggy
result = cudf_polars.containers.DataFrame.from_polars(df.tail(1)).to_polars()
expected = df.tail(1)
polars.testing.asserts.assert_frame_equal(result, expected)

fails with

File /raid/toaugspurger/envs/rapidsai/cudf/25.08/lib/python3.12/site-packages/polars/testing/asserts/utils.py:12, in raise_assertion_error(objects, detail, left, right, cause)
     10 __tracebackhide__ = True
     11 msg = f"{objects} are different ({detail})\n[left]:  {left}\n[right]: {right}"
---> 12 raise AssertionError(msg) from cause

AssertionError: DataFrames are different (value mismatch for column 'a')
[left]:  ['']
[right]: [None]

Jun 13 '25 16:06 TomAugspurger

You're probably right about this being related to the compat level though. If we manually do the conversion and print what we get for various compat levels:

import cudf_polars.containers
import polars as pl
import polars.testing.asserts
import pylibcudf as plc

df = pl.DataFrame({"a": ["a", None]})
# this is buggy
result = cudf_polars.containers.DataFrame.from_polars(df.tail(1)).to_polars()
expected = df.tail(1)


table = plc.Table.from_arrow(df.tail(1))  # this is waht DataFrame.from_polars does
print("no to_arrow", plc.interop.to_arrow(table).columns[0][0].as_py())  # ""

table = plc.Table.from_arrow(df.tail(1).to_arrow())
print("to_arrow-default", plc.interop.to_arrow(table).columns[0][0].as_py())  # None

table = plc.Table.from_arrow(df.tail(1).to_arrow(compat_level=0))
print("to_arrow-compat_level=0", plc.interop.to_arrow(table).columns[0][0].as_py())  # None

table = plc.Table.from_arrow(df.tail(1).to_arrow(compat_level=1))
print("to_arrow-compat_level=1", plc.interop.to_arrow(table).columns[0][0].as_py())  # ""

that prints

no to_arrow 
to_arrow-default None
to_arrow-compat_level=0 None
to_arrow-compat_level=1

Note that there are empty strings for the first and last case.

The fact that plc.Table.from_arrow(dataframe) and plc.Table.from_arrow(dataframe.to_arrow()) are different is maybe a bit concerning. IIUC, plc.Table.from_arrow(polars.DataFrame) will use polars' __arrow_c_stream__ method.

Jun 13 '25 16:06 TomAugspurger

Phew, this probably does end up in pylibcudf's handling of validity maps on sliced columns like Lawrence guessed. https://github.com/rapidsai/cudf/issues/19159 is a clean issue with a reproducer. I'm not yet sure if that's the root cause of this original error, but if I had to guess I'd say they're the same. I'll check once that's fixed.

Jun 13 '25 17:06 TomAugspurger

@TomAugspurger Can you retry this with the fix in https://github.com/rapidsai/cudf/pull/19174 ?

Jun 16 '25 20:06 davidwendt

Thanks @davidwendt, I can confirm that https://github.com/rapidsai/cudf/pull/19174 fixes both https://github.com/rapidsai/cudf/issues/19159 and this issue.

Jun 16 '25 21:06 TomAugspurger

@TomAugspurger Can this be closed now that #19174 is merged?

Jun 18 '25 14:06 davidwendt

Yes, thanks. We don't yet have explicit test coverage for this in cudf-polars, but that'll come as part of https://github.com/rapidsai/cudf/pull/19146 so I think we're good here.

Jun 18 '25 14:06 TomAugspurger