feat: Add `list.pad_start()`
Fixes https://github.com/pola-rs/polars/issues/10283
Questions:
- ~In the linked issue above, the workaround lets the user specify the final length of each sublist but I use the length of the longest sublist and pad all other sublist to match this length instead. Is this correct?~ -> added a
lengthargument instead of automatically taking longest length - I feel like there's too much code duplication in the
matchstatement, but it's not obvious to me how to reduce it. Any idea? - ~Are the failures in new-streaming expected?~ -> solved, related to point 1
Codecov Report
Attention: Patch coverage is 99.23664% with 1 line in your changes missing coverage. Please review.
Project coverage is 79.29%. Comparing base (
084ddde) to head (5d7fe4d). Report is 8 commits behind head on main.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| crates/polars-plan/src/dsl/function_expr/list.rs | 88.88% | 1 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #20674 +/- ##
==========================================
+ Coverage 79.10% 79.29% +0.18%
==========================================
Files 1583 1583
Lines 225265 225676 +411
Branches 2586 2586
==========================================
+ Hits 178188 178941 +753
+ Misses 46487 46145 -342
Partials 590 590
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
It should be possible to support more types (date, datetime, duration, maybe categorical and enum also?), but I'd like to have some feedback on the current implementation before, cf the questions in the first post.
In particular, for now it casts categoricals to string in this case:
pl.DataFrame({"a": [["a"], ["a", "b"]]}, schema={"a": pl.List(pl.Categorical)}).select(
pl.col("a").list.pad_start("foo", length=2)
)
# shape: (2, 1)
# ┌──────────────┐
# │ a │
# │ --- │
# │ list[str] │
# ╞══════════════╡
# │ ["foo", "a"] │
# │ ["a", "b"] │
# └──────────────┘
which I'm not sure is correct.
@etiennebacher One bit of feedback is that the proposal in the original issue, and Expr.str.pad_start/Expr.str.pad_end take a width to pad to. This is important so that the operation can execute in a streaming fashion instead of first having to find the maximum width.
wrt to making it a specified width, could that input be an Expr input so if I'm not streaming I can use .list.len().max()? @orlp would a lit satisfy the streaming engine in that case?
Also, does it make sense for this to return an Array instead of a List? I just assume that all the use cases for making everything a unified width would want an array as the next step.
I was playing around with how to avoid match and came up with this:
let fill_s = fill_value.as_materialized_series();
let out = ca.apply_values(|inner_series| {
let inner_series = inner_series.explode().unwrap();
if inner_series.len() >= width {
inner_series.slice(0i64, width) // is this right, should it truncate or just return as-is?
} else {
// this assumes fill can't be too small, need to repeat if scalar
let need_fill = width - inner_series.len();
let mut fill = fill_s.slice(0i64, need_fill).clone();
fill.append(&inner_series).unwrap();
if fill.len() != width {
panic!("fill+original!=width")
}
let fill = fill; // to undo mut
fill
}
});
I don't know the difference between all the apply methods so not sure if that should be apply_amortized or something else.
@deanm0000
wrt to making it a specified width, could that input be an Expr input so if I'm not streaming I can use .list.len().max()?
Yes, and we should also allow that for Expr.str.pad_start and Expr.str.pad_end. PR's welcome :)
Also, does it make sense for this to return an Array instead of a List?
No, because lists longer than the specified length should stay that way, similar to Expr.str.pad_start.
No, because lists longer than the specified length should stay that way, similar to Expr.str.pad_start.
I hadn't understood that in the original issue. I thought all the sub-lists in the output were guaranteed to have the same length. I guess it makes sense if one can pass .list.len().max() as width.
Closing in favor of https://github.com/pola-rs/polars/pull/23323