polars
polars copied to clipboard
Ability to use expression as `pattern` for `Expr.str` methods `extract`, `count_match` & `extract_groups`
Problem description
I was trying to do substring matching based on 2 columns but got stuck with a problem where count_match only works with a str argument. This can be solved by opting for extract_all, which does support Expr for pattern, and counting the length of the returned list. However it feels like I'm leaving quite some performance on the table.
An example of what the expected functionality would do:
import polars as pl
df = pl.DataFrame({"foo": ["123 bla 45 asd", "xyz 678 910t"], "bar": [r"\d", r"[a-z]"]})
print(
df.select(
pl.col("foo").str.count_match(pl.col("bar")).alias("count_bar")
)
)
Should return:
shape: (2, 1)
┌──────────────┐
│ count_bar │
│ --- │
│ u32 │
╞══════════════╡
│ 5 │
│ 4 │
└──────────────┘
I would love to help out, but I haven't yet contributed before.
Progress
- [X]
count_match - [ ]
extract - [ ]
extract_groups
Most of the steps needed are shown in some previous PRs if you're interested: https://github.com/pola-rs/polars/pull/6355/files
.extract_groups() is a bit different though, as it returns a struct - so multiple patterns would be ambiguous.
Most of the steps needed are shown in some previous PRs if you're interested: https://github.com/pola-rs/polars/pull/6355/files
.extract_groups()is a bit different though, as it returns a struct - so multiple patterns would be ambiguous.
Great! Will take a look after the weekend.
@wdoppenberg Are you also already working on extract? If not I'll take over and do that one.
Also, I believe we should not do extract_groups, it does not really make much sense with a varying pattern from a schema perspective.
@wdoppenberg Are you also already working on
extract? If not I'll take over and do that one.Also, I believe we should not do
extract_groups, it does not really make much sense with a varying pattern from a schema perspective.
No not today. Feel free to take it
I think #13607 closes this.
Yes, this can be closed.