polars
polars copied to clipboard
feat(rust,python): add new `str.find` expression, returning the index of a regex pattern or literal substring
Implements expressified str.find
string functionality for Expr
and Series
, returning the index/position of a given pattern (regex or literal) in the underlying string. Returns ~~-1
~~ (updated) None
if the pattern is not found.
(Also closes #13552 by adding a note about requiring a capture group in the given regex in the extract
and extract_groups
docstrings).
Example
import polars as pl
df = pl.DataFrame(
data=[
("Dubai", 3564931, "b[ai]"),
("Abu Dhabi", 1807000, "b[ai]"),
("Sharjah", 1405000, "[ai]n"),
("Al Ain", 846747, "[ai]+n"),
("Ajman", 490035, "[ai]n"),
("Ras Al Khaimah", 191753, "a.+a"),
("Fujairah", 118933, "a.+a"),
("Umm Al Quwain", 59098, "a.+a"),
],
schema={"city": pl.String, "population": pl.Int32, "pattern": pl.String},
)
df.with_columns(
rx_a = pl.col("city").str.find("(?i)a"),
lit_a = pl.col("city").str.find("a", literal=True),
lit_00 = pl.col("population").cast(pl.String).str.find("00", literal=True),
find_col = pl.col("city").str.find(pl.col("pattern")),
)
# shape: (8, 7)
# ┌────────────────┬────────────┬─────────┬──────┬───────┬────────┬──────────┐
# │ city ┆ population ┆ pattern ┆ rx_a ┆ lit_a ┆ lit_00 ┆ find_col │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ str ┆ i32 ┆ str ┆ u32 ┆ u32 ┆ u32 ┆ u32 │
# ╞════════════════╪════════════╪═════════╪══════╪═══════╪════════╪══════════╡
# │ Dubai ┆ 3564931 ┆ b[ai] ┆ 3 ┆ 3 ┆ null ┆ 2 │
# │ Abu Dhabi ┆ 1807000 ┆ b[ai] ┆ 0 ┆ 6 ┆ 4 ┆ 7 │
# │ Sharjah ┆ 1405000 ┆ [ai]n ┆ 2 ┆ 2 ┆ 4 ┆ null │
# │ Al Ain ┆ 846747 ┆ [ai]n ┆ 0 ┆ null ┆ null ┆ 4 │
# │ Ajman ┆ 490035 ┆ [ai]n ┆ 0 ┆ 3 ┆ 2 ┆ 3 │
# │ Ras Al Khaimah ┆ 191753 ┆ a.+a ┆ 1 ┆ 1 ┆ null ┆ 1 │
# │ Fujairah ┆ 118933 ┆ a.+a ┆ 3 ┆ 3 ┆ null ┆ 3 │
# │ Umm Al Quwain ┆ 59098 ┆ a.+a ┆ 4 ┆ 10 ┆ null ┆ null │
# └────────────────┴────────────┴─────────┴──────┴───────┴────────┴──────────┘
nice!!
But one thing I would definetely change is returning -1
when the pattern is not found.
Returning those sentinal values is very dangenous because it hides errors (if you are not 100% familiar with the function you might miss is because there is no Error/None)!
Rust (using Option/Result) and polars handle this generally very good!
I recommend returning null
if the pattern is not found. This is also in line with other functions like Expr.list.get
which return None/null
if out of bounds.
Or is there anything I miss?
One thing I would definitely change is returning -1 Or is there anything I miss?
If you do that then you can't distinguish between not finding the pattern in a valid string and applying find
to null values. Both cases would return null, but I think these should probably be different results as we are operating in frame context rather than pure Rust-only context (where returning Option/None would indeed be the expected behaviour) 🤔
For example, calling "find" against a string in Python also returns -1
if a match is not found:
"abc".find("x")
-1
SQL's instr
and/or position
methods also distinguish between the two cases; applied to NULL they return NULL, but if they can't find a valid match in a string they return zero (as SQL is 1-indexed - which I really dislike, but such is life ;)
ah I see, thanks for explaining 😉
I'm not completely convinced that returning -1 when the match is not found is superior to returning null. The distinguishing argument is kind of moot since you can just use .is_null()
on the input if necessary. I also find it sketchy when it is combined with offsets, e.g. x.find("foo") + 10
.
I'm not completely convinced that returning -1 when the match is not found is superior to returning null. The distinguishing argument is kind of moot since you can just use
.is_null()
on the input if necessary. I also find it sketchy when it is combined with offsets, e.g.x.find("foo") + 10
.
True enough; it's not a hill I'd die on - there's also an .index
method in Python that raises an error if the substring doesn't exist, which is (hand-wavingly) analogous to returning null (though I'm not at all convinced that we should follow suit and have two almost identical methods ;). Would CSE eliminate the cost of running a potenitially expensive expression twice? (Once to check if it's null, then once more to apply the find
if it isn't?)
@ritchie46, want to tie-break this one? There are pros/cons to both (I like distinguishing between "not found" and "applied to null" at a glance), but @orlp's point about combining with offsets is quite compelling 🤔
@alexander-beedie Another argument in favour of null is if you try to compute any statistics on the found index, e.g. 'average text length before the first link' with pl.col(c).str.find("https?://").mean()
. Including -1
there would give bad results.
@alexander-beedie Another argument in favour of null is if you try to compute any statistics on the found index, e.g. 'average text length before the first link' with
pl.col(c).str.find("https?://").mean()
. Including-1
there would give bad results.
I feel myself being persuaded 🤣
Even more disastrous is the minimum, e.g. trying to find the webpage with the shortest header: pl.col("html").str.find("</head>").arg_min()
.
The tipping point has been reached; I'm changing it 😆 (For bonus points the result type can be downsized from an i64 to a u32 now it doesn't have to handle a negative value).
Done; further streamlined a few things while I was at it... Thanks for the suggestions guys 👌