narwhals icon indicating copy to clipboard operation
narwhals copied to clipboard

[Bug]: fill_null bug with pyarrow 21.0.0

Open yuuuxt opened this issue 4 months ago • 11 comments

Tracking

  • [x] https://github.com/apache/arrow/issues/47234

Describe the bug

fill_null returns unexpected result.

Steps or code to reproduce the bug

The bug appeared when using duckdb backend, but it turns out I can reproduce it without duckdb.

import io
import pandas as pd
import narwhals as nw
import pyarrow
print(nw.__version__)
print(pyarrow.__version__)

check_pd_df = pd.read_csv(io.StringIO("""some_index	some_value
0	10
1	20
2	30
3	40
4	50
5
"""), sep="\t")

print(
    nw.from_native(nw.from_native(check_pd_df).to_arrow())
    .with_columns([nw.col("some_value").fill_null(999)])
    .to_native()
)

Expected results

With pyarrow 20.0.0, the results (correct) are

2.2.0
20.0.0
pyarrow.Table
some_index: int64
some_value: double
----
some_index: [[0,1,2,3,4,5]]
some_value: [[10,20,30,40,50,999]]

Actual results

With pyarrow 21.0.0, the results (wrong) are

2.2.0
21.0.0
pyarrow.Table
some_index: int64
some_value: double
----
some_index: [[0,1,2,3,4,5]]
some_value: [[10,999,30,40,50,nan]]

Please run narwhals.show_version() and enter the output below.

System:
    python: 3.12.9 (main, Feb 12 2025, 14:52:31) [MSC v.1942 64 bit (AMD64)]
executable: <path-to-project>\.venv\Scripts\python.exe
   machine: Windows-10-10.0.19045-SP0

Python dependencies:
     narwhals: 2.2.0
        numpy: 2.3.2
       pandas: 2.3.2
        modin: 
         cudf: 
      pyarrow: 21.0.0
      pyspark: 
       polars: 1.32.3
         dask: 
       duckdb: 1.3.2
         ibis: 
     sqlframe:

yuuuxt avatar Aug 28 '25 04:08 yuuuxt

Hey @yuuuxt thanks for reporting the issue.

I am not able to replicate on MacOS 🤔

import io
import pandas as pd
import narwhals as nw
import pyarrow
print(nw.__version__)
print(pyarrow.__version__)

check_pd_df = pd.read_csv(io.StringIO("""some_index	some_value
0	10
1	20
2	30
3	40
4	50
5
"""), sep="\t")

print(
    nw.from_native(nw.from_native(check_pd_df).to_arrow())
    .with_columns(some_value_filled = nw.col("some_value").fill_null(999))
    .to_native()
)

Outputs:

2.2.0
21.0.0
pyarrow.Table
some_index: int64
some_value: double
some_value_filled: double
----
some_index: [[0,1,2,3,4,5]]
some_value: [[10,20,30,40,50,null]]
some_value_filled: [[10,20,30,40,50,999]]

I will wait for someone with easy access to windows machine

FBruzzesi avatar Aug 28 '25 07:08 FBruzzesi

OK, I will try to check it on linux as well when I'm back home.

yuuuxt avatar Aug 28 '25 07:08 yuuuxt

thanks @yuuuxt for the report

I also can't reproduce this issue on Linux

what happens if you just do the operation directly in pyarrow without going through narwhals?

MarcoGorelli avatar Aug 28 '25 07:08 MarcoGorelli

I will wait for someone with easy access to windows machine

That would be me, sadly 😉

Can confirm the repro:

import io
import pandas as pd
import narwhals as nw
import pyarrow

check_pd_df = pd.read_csv(io.StringIO("""some_index	some_value
0	10
1	20
2	30
3	40
4	50
5
"""), sep="\t")

print(
    nw.from_native(nw.from_native(check_pd_df).to_arrow())
    .with_columns([nw.col("some_value").fill_null(999)])
    .to_native()
)
pyarrow.Table
some_index: int64
some_value: double
----
some_index: [[0,1,2,3,4,5]]
some_value: [[10,999,30,40,50,nan]]
Show versions

System:
    python: 3.13.5 (main, Jun 12 2025, 12:42:35) [MSC v.1943 64 bit (AMD64)]
executable: c:\Users\danie\Documents\GitHub\narwhals\.venv\Scripts\python.exe
   machine: Windows-10-10.0.19045-SP0

Python dependencies:
     narwhals: 2.2.0
        numpy: 2.3.2
       pandas: 2.3.2
        modin: 0.35.0
         cudf: 
      pyarrow: 21.0.0
      pyspark: 
       polars: 1.32.2
         dask: 2025.7.0
       duckdb: 1.3.0
         ibis: 10.8.0
     sqlframe: 3.39.2

There's another pyarrow>=21 fill_null bug on windows, ~~but I can't reproduce that one~~ (https://github.com/apache/arrow/issues/47234)

Edit: jfc yes I can 🤦‍♂️ - I didn't realise they added markup manually (thanks @yuuuxt (https://github.com/narwhals-dev/narwhals/issues/3048#issuecomment-3232910258))

[
  true,
  **true**,
  false,
  false,
  false,
  **false**
]

dangotbanned avatar Aug 28 '25 10:08 dangotbanned

@dangotbanned Thanks for the confirmation!

@MarcoGorelli I assume I don't need to follow up now as @dangotbanned is investigating on Windows, right?

yuuuxt avatar Aug 28 '25 10:08 yuuuxt

for https://github.com/apache/arrow/issues/47234 - seems I reproduced it:

# /// script
# dependencies = [
#   "pyarrow==21.0.0",
# ]
# ///
import pyarrow as pa

z = pa.array([True, False, False, False, False, None])
print(pa.__version__)
print(z.fill_null(True))

the result is

21.0.0
[
  true,
  true,
  false,
  false,
  false,
  false
]

while for pyarrow 20.0.0 it's like

20.0.0
[
  true,
  false,
  false,
  false,
  false,
  true
]

yuuuxt avatar Aug 28 '25 10:08 yuuuxt

Thanks for the confirmation!

No worries!

as @dangotbanned is investigating on Windows, right?

Just a confirmation from me 😅

I assume I don't need to follow up now

Actually it would be helpful to try and minimise this repro first

Can you reduce this to:

  • not include pd.read_csv?
  • not include pandas?
  • not include narwhals?

I realised I was a bit quick to label this as upstream (https://github.com/narwhals-dev/narwhals/issues/3048#event-19388414681), since there are a few unknowns still

dangotbanned avatar Aug 28 '25 10:08 dangotbanned

To me what's also concerning is that our test suite didn't pick that up. We run pyarrow on windows: https://github.com/narwhals-dev/narwhals/blob/bfb61bf93eeaf18784d91baf43354a3ca018ffc2/.github/workflows/pytest.yml#L37

FBruzzesi avatar Aug 28 '25 11:08 FBruzzesi

tbh i'm more concerned about pyarrow than about our test suite 😄

we don't do anything fancy here, we literally just call pc.fill_null

https://github.com/narwhals-dev/narwhals/blob/bfb61bf93eeaf18784d91baf43354a3ca018ffc2/narwhals/_arrow/series.py#L688-L690

MarcoGorelli avatar Aug 28 '25 12:08 MarcoGorelli

@MarcoGorelli sure that's fair. They are aware and marked https://github.com/apache/arrow/issues/47234 as critical fix and also proposed a backport.

Still I would expect one of our test to fail 🤔

FBruzzesi avatar Aug 28 '25 12:08 FBruzzesi

As an update: the upstream issue has been closed/fixed and they mentioned that the next release (pyarrow 22.0) will be in October

FBruzzesi avatar Sep 15 '25 17:09 FBruzzesi