feat: `ArrowDataFrame.explode`
What type of PR is this? (check all applicable)
- [ ] 💾 Refactor
- [x] ✨ Feature
- [ ] 🐛 Bug Fix
- [ ] 🔧 Optimization
- [ ] 📝 Documentation
- [ ] ✅ Test
- [ ] 🐳 Other
Checklist
- [x] Code follows style guide (ruff)
- [x] Tests added
- [ ] Documented the changes
If you have comments or can explain your changes, please do so below
I will leave this as draft until we decide how to move forward.
To summarize the discussion(s) in #1542 :
- pyarrow native methods ignore nulls and empty list in explode
- the workaround here is to have a "fast_path" for when nulls or empty lists are not present, and a dedicated path for when they are
- the issue is that we enter python world to create the index via one
.to_pylist()call - pandas seems to enter cython anyway to explode a list
@FBruzzesi I feel like this shouldn't have got lost!
ArrowDataFrame.explode is 1 of 3 remaining implementations we need
https://github.com/narwhals-dev/narwhals/blob/2bcc6bbfe188889c15ae7eab544f5ba5e91ba695/narwhals/_arrow/dataframe.py#L350
https://github.com/narwhals-dev/narwhals/blob/2bcc6bbfe188889c15ae7eab544f5ba5e91ba695/narwhals/_arrow/dataframe.py#L466
I might add a PR for ArrowDataFrame.clone - since it can just utilize arrow data being immutable
https://github.com/narwhals-dev/narwhals/blob/2bcc6bbfe188889c15ae7eab544f5ba5e91ba695/narwhals/_arrow/dataframe.py#L669
I feel like this shouldn't have got lost!
Thanks @dangotbanned ♥️ The main concern was a conversion to python object: filled_counts.to_pylist() in:
if fast_path:
indices = pc.list_parent_indices(native_frame[to_explode[0]])
flatten_func = pc.list_flatten
else:
filled_counts = pc.max_element_wise(counts, 1, skip_nulls=True)
indices = pa.array(
[
i
for i, count in enumerate(filled_counts.to_pylist())
for _ in range(count)
]
)
https://github.com/narwhals-dev/narwhals/pull/1644#issuecomment-2750901354
Maybe we can figure out another path hidden somewhere in the stubs? 🤔
- [ ] https://github.com/pola-rs/polars/issues/17664
@FBruzzesi @MarcoGorelli
It seems like polars wants to make a breaking change in the next major version - resulting in the same behavior as pyarrow.
If we had that behavior as the goal - I think pc.list_flatten(..., recursive=True) would get us most of the way there.
Just something to keep in mind for the future 🙂