narwhals icon indicating copy to clipboard operation
narwhals copied to clipboard

feat: `ArrowDataFrame.explode`

Open FBruzzesi opened this issue 1 year ago • 4 comments

What type of PR is this? (check all applicable)

  • [ ] 💾 Refactor
  • [x] ✨ Feature
  • [ ] 🐛 Bug Fix
  • [ ] 🔧 Optimization
  • [ ] 📝 Documentation
  • [ ] ✅ Test
  • [ ] 🐳 Other

Checklist

  • [x] Code follows style guide (ruff)
  • [x] Tests added
  • [ ] Documented the changes

If you have comments or can explain your changes, please do so below

I will leave this as draft until we decide how to move forward.

To summarize the discussion(s) in #1542 :

  • pyarrow native methods ignore nulls and empty list in explode
  • the workaround here is to have a "fast_path" for when nulls or empty lists are not present, and a dedicated path for when they are
  • the issue is that we enter python world to create the index via one .to_pylist() call
  • pandas seems to enter cython anyway to explode a list

FBruzzesi avatar Dec 22 '24 11:12 FBruzzesi

@FBruzzesi I feel like this shouldn't have got lost!

ArrowDataFrame.explode is 1 of 3 remaining implementations we need

https://github.com/narwhals-dev/narwhals/blob/2bcc6bbfe188889c15ae7eab544f5ba5e91ba695/narwhals/_arrow/dataframe.py#L350

https://github.com/narwhals-dev/narwhals/blob/2bcc6bbfe188889c15ae7eab544f5ba5e91ba695/narwhals/_arrow/dataframe.py#L466

I might add a PR for ArrowDataFrame.clone - since it can just utilize arrow data being immutable

https://github.com/narwhals-dev/narwhals/blob/2bcc6bbfe188889c15ae7eab544f5ba5e91ba695/narwhals/_arrow/dataframe.py#L669

dangotbanned avatar Mar 25 '25 11:03 dangotbanned

I feel like this shouldn't have got lost!

Thanks @dangotbanned ♥️ The main concern was a conversion to python object: filled_counts.to_pylist() in:

    if fast_path:
        indices = pc.list_parent_indices(native_frame[to_explode[0]])
        flatten_func = pc.list_flatten

    else:
        filled_counts = pc.max_element_wise(counts, 1, skip_nulls=True)
        indices = pa.array(
            [
                i
                for i, count in enumerate(filled_counts.to_pylist())
                for _ in range(count)
            ]
        )

FBruzzesi avatar Mar 25 '25 11:03 FBruzzesi

https://github.com/narwhals-dev/narwhals/pull/1644#issuecomment-2750901354

Maybe we can figure out another path hidden somewhere in the stubs? 🤔

dangotbanned avatar Mar 25 '25 11:03 dangotbanned

  • [ ] https://github.com/pola-rs/polars/issues/17664

@FBruzzesi @MarcoGorelli

It seems like polars wants to make a breaking change in the next major version - resulting in the same behavior as pyarrow.

If we had that behavior as the goal - I think pc.list_flatten(..., recursive=True) would get us most of the way there. Just something to keep in mind for the future 🙂

dangotbanned avatar Mar 25 '25 13:03 dangotbanned