pandas icon indicating copy to clipboard operation
pandas copied to clipboard

BUG: PyArrow string: list indexing and negative indices fail after str.split()

Open Elliottt-Chen opened this issue 1 month ago • 6 comments

Pandas version checks

  • [x] I have checked that this issue has not already been reported.

  • [x] I have confirmed this bug exists on the latest version of pandas.

  • [ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
from pandas.core.frame import Series

print(pd.__version__)  # '3.0.0.dev0+2752.ga885d67965'

s: Series = pd.Series(["A-B", "C-D"]).convert_dtypes(dtype_backend="pyarrow")
s.str.split("-")

>
# 0    ['A' 'B']
# 1    ['C' 'D']

s.str.split("-").str[0]

# AttributeError: Can only use .str accessor with string values, not unknown-array. 
# Did you mean: 'std'?

s.str.split("-").list[0]

>
# 0    A
# 1    C

s.str.split("-").list[-1]

>
# pyarrow.lib.ArrowInvalid: Index -1 is out of bounds: should be greater than or 
# equal to 0

Issue Description

I noticed that the latest nightly wheel (pandas 3.0.0.dev0+2752.ga885d67965) does not allow indexing into list elements produced by str.split() when using the pyarrow-backed string dtype.

s1 = Series(...).convert_dtype(dtype_backend="pyarrow") s1.str.split().str[...] # attempting to access list items via str accessor s1.str.split().list[-1] # negative index access via list accessor

Specifically, list element access via slice syntax (str[...]) and negative indexing (list[-1]) appear to be unsupported for pyarrow string arrays in the current nightly build.

The error message went like this:

  1. AttributeError: Can only use .str accessor with string values, not unknown-array. Did you mean: 'std'? (str accessor)
  2. pyarrow.lib.ArrowInvalid: Index -1 is out of bounds: should be greater than or equal to 0 (list accessor)

Expected Behavior

s = s.astype(pd.StringDtype())

s.str.split("-").str[-1]

0 B 1 D dtype: object

Installed Versions

INSTALLED VERSIONS

commit : a885d6796501923905b9381ecafde07ada57a914 python : 3.14.0 python-bits : 64 OS : Darwin OS-release : 24.6.0 Version : Darwin Kernel Version 24.6.0: Wed Oct 15 21:09:41 PDT 2025; root:xnu-11417.140.69.703.14~1/RELEASE_ARM64_T8122 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8

pandas : 3.0.0.dev0+2752.ga885d67965 numpy : 2.3.5 dateutil : 2.9.0.post0 pip : 25.3 Cython : None sphinx : None IPython : 9.7.0 adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.14.2 bottleneck : None fastparquet : 2024.11.0 fsspec : 2025.10.0 html5lib : None hypothesis : None gcsfs : None jinja2 : 3.1.6 lxml.etree : None matplotlib : 3.10.7 numba : None numexpr : 2.14.1 odfpy : None openpyxl : 3.1.5 psycopg2 : None pymysql : None pyarrow : 22.0.0 pyiceberg : 0.10.0 pyreadstat : 1.3.2 pytest : None python-calamine : None pytz : 2025.2 pyxlsb : 1.0.10 s3fs : None scipy : 1.16.3 sqlalchemy : 2.0.44 tables : 3.10.2 tabulate : 0.9.0 xarray : 2025.11.0 xlrd : 2.0.2 xlsxwriter : 3.2.9 zstandard : None qtpy : None pyqt5 : None

Elliottt-Chen avatar Nov 27 '25 19:11 Elliottt-Chen

I'm positive on supporting negative indices in .list[key]. In fact, it appears there is already commented out code to do this:

https://github.com/pandas-dev/pandas/blob/b6d67b718807b0c108eea78e4ed3d13fd78d5d26/pandas/core/arrays/arrow/accessors.py#L159-L162

However I don't think we should allow using pyarrow lists with .str.

rhshadrach avatar Nov 29 '25 14:11 rhshadrach

Thank you for the clarification. I understand the blocker is that pc.list_element currently only accepts scalar indices.

To unblock users in the short term, I can implement a workaround for negative indexing directly in pandas. The approach would be:

For key >= 0, we continue to use the efficient pc.list_element(self._pa_array, key). For key < 0, we use a Python-level fallback. I'm thinking of a vectorized approach using pc.list_value_length to calculate valid indices per list, then extracting elements. This will be slower than a native Arrow implementation, but it will make the feature work.

This implementation would come with a clear PerformanceWarning to set user expectations.

Would you be open to a PR that adds negative index support in this way? If so, I can proceed with the implementation. @rhshadrach

Aokizy2 avatar Dec 06 '25 07:12 Aokizy2

@Aokizy2 - unless I'm missing something, I think we just need to uncomment the two lines I highlighted in my last post, which I think is equivalent to what you proposed (and add tests). But this should not raise a performance warning.

rhshadrach avatar Dec 06 '25 11:12 rhshadrach

@rhshadrach -Thanks for the direction. I've looked into the code, and I believe there's a small misunderstanding about those two commented lines.

The issue is that pc.add(key, pc.list_value_length(self._pa_array)) would produce an array of indices (one adjusted index for each list). However, the very next line pc.list_element(self._pa_array, key) requires its second argument (key) to be a single integer (scalar), not an array. This is exactly what the TODO comment itself states: “pyarrow does not allow element index to be an array.”

Therefore, simply uncommenting those two lines would result in a runtime ArrowNotImplementedError because we'd be passing an array where a scalar is expected.

To move forward, I think we have two options:

Option 1 (Workaround in Pandas): Implement the logic described in the comment, but handle the iteration ourselves (e.g., loop through each list, calculate the adjusted index, and extract the element). This would make .list[-1] work for users immediately.

Option 2 (Wait for upstream fix): Hold off on any workaround and focus on requesting or contributing the array-index feature directly in the PyArrow project. This would be the cleaner long-term solution.

Given that users are already reporting this issue, what do you think about proceeding with Option 1? I'm happy to implement it if you agree.

Aokizy2 avatar Dec 06 '25 11:12 Aokizy2

Ah, thanks, that explains why they are commented! I would benchmark your proposed Option 1 against the algorithm in https://github.com/apache/arrow/issues/48349#issuecomment-3617144584 and take whichever is more performant for reasonably sized data (100k rows perhaps).

rhshadrach avatar Dec 06 '25 13:12 rhshadrach

OK,thanks! I will work on it.

Aokizy2 avatar Dec 06 '25 13:12 Aokizy2

codes for Option1:

if key < 0:
    lengths = pc.list_value_length(self._pa_array)
    min_length = min([l.as_py() for l in lengths if l is not None])
    if key < -min_length:
        raise IndexError(f"list index {key} out of bounds")
    actual_key = key
    results = []
    for i in range(len(self._pa_array)):
        current_length = lengths[i].as_py() if lengths[i] is not None else 0
        actual_index = key + current_length
        if 0 <= actual_index < current_length:
            element = pc.list_element(self._pa_array.slice(i, 1), actual_index)
            results.append(element[0] if element and len(element) > 0 else None)
        else:
            raise IndexError(f"list index {key} out of bounds for list at position {i}")
    try:
        result_array = pa.array(results)
    except (pa.ArrowInvalid, TypeError):
        result_array = pa.array(results, type=pa.null())
    return Series(
        result_array,
        dtype=ArrowDtype(result_array.type),
        index=self._data.index,
        name=self._data.name,
    )

Option2:

if key < 0:
    arr = self._pa_array
    lengths = pc.list_value_length(arr)
    for i in range(len(arr)):
        current_length = lengths[i].as_py() if lengths[i] is not None else 0
        if current_length == 0:
            raise IndexError(
                f"list index {key} out of range for empty list at position {i}"
            )
        if current_length < abs(key):
            raise IndexError(
                f"list index {key} out of range for list of length {current_length} at position {i}"
            )
    chunks = arr.chunks if isinstance(arr, pa.ChunkedArray) else [arr]
    all_results = []
    for chunk in chunks:
        if len(chunk) > 0:
            chunk_lengths = pc.list_value_length(chunk)
            chunk_offsets = chunk.offsets
            
            indices = pa.array([
                chunk_offsets[i].as_py() + (chunk_lengths[i].as_py() + key)
                for i in range(len(chunk))
            ])
            all_results.append(chunk.values.take(indices))
    result_values = pa.concat_arrays(all_results) if all_results else pa.array([], type=arr.type.value_type)
    return Series(
        result_values,
        dtype=ArrowDtype(result_values.type),
        index=self._data.index,
        name=self._data.name,
    )

Performance Results:

Option 1 (Loop-based): ~1850 ms for 100,000 lists

Option 2 (Vectorized): ~220 ms for 100,000 lists

hi ! @rhshadrach ,could you please help check if there are any issues with this? If it looks good, I'll submit a PR and adopt the second Option.

Aokizy2 avatar Dec 11 '25 06:12 Aokizy2

Thanks @Aokizy2 - I would suggest putting up a PR as a next step.

rhshadrach avatar Dec 11 '25 21:12 rhshadrach