[Bug]: fill_null bug with pyarrow 21.0.0
Tracking
- [x] https://github.com/apache/arrow/issues/47234
Describe the bug
fill_null returns unexpected result.
Steps or code to reproduce the bug
The bug appeared when using duckdb backend, but it turns out I can reproduce it without duckdb.
import io
import pandas as pd
import narwhals as nw
import pyarrow
print(nw.__version__)
print(pyarrow.__version__)
check_pd_df = pd.read_csv(io.StringIO("""some_index some_value
0 10
1 20
2 30
3 40
4 50
5
"""), sep="\t")
print(
nw.from_native(nw.from_native(check_pd_df).to_arrow())
.with_columns([nw.col("some_value").fill_null(999)])
.to_native()
)
Expected results
With pyarrow 20.0.0, the results (correct) are
2.2.0
20.0.0
pyarrow.Table
some_index: int64
some_value: double
----
some_index: [[0,1,2,3,4,5]]
some_value: [[10,20,30,40,50,999]]
Actual results
With pyarrow 21.0.0, the results (wrong) are
2.2.0
21.0.0
pyarrow.Table
some_index: int64
some_value: double
----
some_index: [[0,1,2,3,4,5]]
some_value: [[10,999,30,40,50,nan]]
Please run narwhals.show_version() and enter the output below.
System:
python: 3.12.9 (main, Feb 12 2025, 14:52:31) [MSC v.1942 64 bit (AMD64)]
executable: <path-to-project>\.venv\Scripts\python.exe
machine: Windows-10-10.0.19045-SP0
Python dependencies:
narwhals: 2.2.0
numpy: 2.3.2
pandas: 2.3.2
modin:
cudf:
pyarrow: 21.0.0
pyspark:
polars: 1.32.3
dask:
duckdb: 1.3.2
ibis:
sqlframe:
Hey @yuuuxt thanks for reporting the issue.
I am not able to replicate on MacOS 🤔
import io
import pandas as pd
import narwhals as nw
import pyarrow
print(nw.__version__)
print(pyarrow.__version__)
check_pd_df = pd.read_csv(io.StringIO("""some_index some_value
0 10
1 20
2 30
3 40
4 50
5
"""), sep="\t")
print(
nw.from_native(nw.from_native(check_pd_df).to_arrow())
.with_columns(some_value_filled = nw.col("some_value").fill_null(999))
.to_native()
)
Outputs:
2.2.0
21.0.0
pyarrow.Table
some_index: int64
some_value: double
some_value_filled: double
----
some_index: [[0,1,2,3,4,5]]
some_value: [[10,20,30,40,50,null]]
some_value_filled: [[10,20,30,40,50,999]]
I will wait for someone with easy access to windows machine
OK, I will try to check it on linux as well when I'm back home.
thanks @yuuuxt for the report
I also can't reproduce this issue on Linux
what happens if you just do the operation directly in pyarrow without going through narwhals?
I will wait for someone with easy access to windows machine
That would be me, sadly 😉
Can confirm the repro:
import io
import pandas as pd
import narwhals as nw
import pyarrow
check_pd_df = pd.read_csv(io.StringIO("""some_index some_value
0 10
1 20
2 30
3 40
4 50
5
"""), sep="\t")
print(
nw.from_native(nw.from_native(check_pd_df).to_arrow())
.with_columns([nw.col("some_value").fill_null(999)])
.to_native()
)
pyarrow.Table
some_index: int64
some_value: double
----
some_index: [[0,1,2,3,4,5]]
some_value: [[10,999,30,40,50,nan]]
Show versions
System:
python: 3.13.5 (main, Jun 12 2025, 12:42:35) [MSC v.1943 64 bit (AMD64)]
executable: c:\Users\danie\Documents\GitHub\narwhals\.venv\Scripts\python.exe
machine: Windows-10-10.0.19045-SP0
Python dependencies:
narwhals: 2.2.0
numpy: 2.3.2
pandas: 2.3.2
modin: 0.35.0
cudf:
pyarrow: 21.0.0
pyspark:
polars: 1.32.2
dask: 2025.7.0
duckdb: 1.3.0
ibis: 10.8.0
sqlframe: 3.39.2
There's another pyarrow>=21 fill_null bug on windows, ~~but I can't reproduce that one~~ (https://github.com/apache/arrow/issues/47234)
Edit: jfc yes I can 🤦♂️ - I didn't realise they added markup manually (thanks @yuuuxt (https://github.com/narwhals-dev/narwhals/issues/3048#issuecomment-3232910258))
[
true,
**true**,
false,
false,
false,
**false**
]
@dangotbanned Thanks for the confirmation!
@MarcoGorelli I assume I don't need to follow up now as @dangotbanned is investigating on Windows, right?
for https://github.com/apache/arrow/issues/47234 - seems I reproduced it:
# /// script
# dependencies = [
# "pyarrow==21.0.0",
# ]
# ///
import pyarrow as pa
z = pa.array([True, False, False, False, False, None])
print(pa.__version__)
print(z.fill_null(True))
the result is
21.0.0
[
true,
true,
false,
false,
false,
false
]
while for pyarrow 20.0.0 it's like
20.0.0
[
true,
false,
false,
false,
false,
true
]
Thanks for the confirmation!
No worries!
as @dangotbanned is investigating on Windows, right?
Just a confirmation from me 😅
I assume I don't need to follow up now
Actually it would be helpful to try and minimise this repro first
Can you reduce this to:
- not include
pd.read_csv? - not include
pandas? - not include
narwhals?
I realised I was a bit quick to label this as upstream (https://github.com/narwhals-dev/narwhals/issues/3048#event-19388414681), since there are a few unknowns still
To me what's also concerning is that our test suite didn't pick that up. We run pyarrow on windows: https://github.com/narwhals-dev/narwhals/blob/bfb61bf93eeaf18784d91baf43354a3ca018ffc2/.github/workflows/pytest.yml#L37
tbh i'm more concerned about pyarrow than about our test suite 😄
we don't do anything fancy here, we literally just call pc.fill_null
https://github.com/narwhals-dev/narwhals/blob/bfb61bf93eeaf18784d91baf43354a3ca018ffc2/narwhals/_arrow/series.py#L688-L690
@MarcoGorelli sure that's fair. They are aware and marked https://github.com/apache/arrow/issues/47234 as critical fix and also proposed a backport.
Still I would expect one of our test to fail 🤔
As an update: the upstream issue has been closed/fixed and they mentioned that the next release (pyarrow 22.0) will be in October