polars icon indicating copy to clipboard operation
polars copied to clipboard

Ability to use expression as `pattern` for `Expr.str` methods `extract`, `count_match` & `extract_groups`

Open wdoppenberg opened this issue 2 years ago • 5 comments
trafficstars

Problem description

I was trying to do substring matching based on 2 columns but got stuck with a problem where count_match only works with a str argument. This can be solved by opting for extract_all, which does support Expr for pattern, and counting the length of the returned list. However it feels like I'm leaving quite some performance on the table.

An example of what the expected functionality would do:

import polars as pl

df = pl.DataFrame({"foo": ["123 bla 45 asd", "xyz 678 910t"], "bar": [r"\d", r"[a-z]"]})
print(
	df.select(
		pl.col("foo").str.count_match(pl.col("bar")).alias("count_bar")
	)
)

Should return:

shape: (2, 1)
┌──────────────┐
│ count_bar    │
│ ---          │
│ u32          │
╞══════════════╡
│ 5            │
│ 4            │
└──────────────┘

I would love to help out, but I haven't yet contributed before.


Progress

  • [X] count_match
  • [ ] extract
  • [ ] extract_groups

wdoppenberg avatar Sep 01 '23 07:09 wdoppenberg

Most of the steps needed are shown in some previous PRs if you're interested: https://github.com/pola-rs/polars/pull/6355/files

.extract_groups() is a bit different though, as it returns a struct - so multiple patterns would be ambiguous.

cmdlineluser avatar Sep 01 '23 10:09 cmdlineluser

Most of the steps needed are shown in some previous PRs if you're interested: https://github.com/pola-rs/polars/pull/6355/files

.extract_groups() is a bit different though, as it returns a struct - so multiple patterns would be ambiguous.

Great! Will take a look after the weekend.

wdoppenberg avatar Sep 01 '23 13:09 wdoppenberg

@wdoppenberg Are you also already working on extract? If not I'll take over and do that one.

Also, I believe we should not do extract_groups, it does not really make much sense with a varying pattern from a schema perspective.

orlp avatar Sep 05 '23 11:09 orlp

@wdoppenberg Are you also already working on extract? If not I'll take over and do that one.

Also, I believe we should not do extract_groups, it does not really make much sense with a varying pattern from a schema perspective.

No not today. Feel free to take it

wdoppenberg avatar Sep 05 '23 11:09 wdoppenberg

I think #13607 closes this.

cmdlineluser avatar Jan 11 '24 10:01 cmdlineluser

Yes, this can be closed.

orlp avatar Apr 16 '24 12:04 orlp