polars icon indicating copy to clipboard operation
polars copied to clipboard

feat(rust,python): add new `str.find` expression, returning the index of a regex pattern or literal substring

Open alexander-beedie opened this issue 1 year ago • 10 comments

Implements expressified str.find string functionality for Expr and Series, returning the index/position of a given pattern (regex or literal) in the underlying string. Returns ~~-1~~ (updated) None if the pattern is not found.

(Also closes #13552 by adding a note about requiring a capture group in the given regex in the extract and extract_groups docstrings).

Example

import polars as pl

df = pl.DataFrame(
    data=[
        ("Dubai", 3564931, "b[ai]"),
        ("Abu Dhabi", 1807000, "b[ai]"),
        ("Sharjah", 1405000, "[ai]n"),
        ("Al Ain", 846747, "[ai]+n"),
        ("Ajman", 490035, "[ai]n"),
        ("Ras Al Khaimah", 191753, "a.+a"),
        ("Fujairah", 118933, "a.+a"),
        ("Umm Al Quwain", 59098, "a.+a"),
    ],
    schema={"city": pl.String, "population": pl.Int32, "pattern": pl.String},
)

df.with_columns(
    rx_a = pl.col("city").str.find("(?i)a"),
    lit_a = pl.col("city").str.find("a", literal=True),
    lit_00 = pl.col("population").cast(pl.String).str.find("00", literal=True),
    find_col = pl.col("city").str.find(pl.col("pattern")),
)
# shape: (8, 7)
# ┌────────────────┬────────────┬─────────┬──────┬───────┬────────┬──────────┐
# │ city           ┆ population ┆ pattern ┆ rx_a ┆ lit_a ┆ lit_00 ┆ find_col │
# │ ---            ┆ ---        ┆ ---     ┆ ---  ┆ ---   ┆ ---    ┆ ---      │
# │ str            ┆ i32        ┆ str     ┆ u32  ┆ u32   ┆ u32    ┆ u32      │
# ╞════════════════╪════════════╪═════════╪══════╪═══════╪════════╪══════════╡
# │ Dubai          ┆ 3564931    ┆ b[ai]   ┆ 3    ┆ 3     ┆ null   ┆ 2        │
# │ Abu Dhabi      ┆ 1807000    ┆ b[ai]   ┆ 0    ┆ 6     ┆ 4      ┆ 7        │
# │ Sharjah        ┆ 1405000    ┆ [ai]n   ┆ 2    ┆ 2     ┆ 4      ┆ null     │
# │ Al Ain         ┆ 846747     ┆ [ai]n   ┆ 0    ┆ null  ┆ null   ┆ 4        │
# │ Ajman          ┆ 490035     ┆ [ai]n   ┆ 0    ┆ 3     ┆ 2      ┆ 3        │
# │ Ras Al Khaimah ┆ 191753     ┆ a.+a    ┆ 1    ┆ 1     ┆ null   ┆ 1        │
# │ Fujairah       ┆ 118933     ┆ a.+a    ┆ 3    ┆ 3     ┆ null   ┆ 3        │
# │ Umm Al Quwain  ┆ 59098      ┆ a.+a    ┆ 4    ┆ 10    ┆ null   ┆ null     │
# └────────────────┴────────────┴─────────┴──────┴───────┴────────┴──────────┘

alexander-beedie avatar Jan 09 '24 12:01 alexander-beedie

nice!! But one thing I would definetely change is returning -1 when the pattern is not found.

Returning those sentinal values is very dangenous because it hides errors (if you are not 100% familiar with the function you might miss is because there is no Error/None)!

Rust (using Option/Result) and polars handle this generally very good! I recommend returning null if the pattern is not found. This is also in line with other functions like Expr.list.get which return None/null if out of bounds.

Or is there anything I miss?

Julian-J-S avatar Jan 09 '24 13:01 Julian-J-S

One thing I would definitely change is returning -1 Or is there anything I miss?

If you do that then you can't distinguish between not finding the pattern in a valid string and applying find to null values. Both cases would return null, but I think these should probably be different results as we are operating in frame context rather than pure Rust-only context (where returning Option/None would indeed be the expected behaviour) 🤔

For example, calling "find" against a string in Python also returns -1 if a match is not found:

"abc".find("x")
-1

SQL's instr and/or position methods also distinguish between the two cases; applied to NULL they return NULL, but if they can't find a valid match in a string they return zero (as SQL is 1-indexed - which I really dislike, but such is life ;)

alexander-beedie avatar Jan 09 '24 13:01 alexander-beedie

ah I see, thanks for explaining 😉

Julian-J-S avatar Jan 09 '24 15:01 Julian-J-S

I'm not completely convinced that returning -1 when the match is not found is superior to returning null. The distinguishing argument is kind of moot since you can just use .is_null() on the input if necessary. I also find it sketchy when it is combined with offsets, e.g. x.find("foo") + 10.

orlp avatar Jan 09 '24 16:01 orlp

I'm not completely convinced that returning -1 when the match is not found is superior to returning null. The distinguishing argument is kind of moot since you can just use .is_null() on the input if necessary. I also find it sketchy when it is combined with offsets, e.g. x.find("foo") + 10.

True enough; it's not a hill I'd die on - there's also an .index method in Python that raises an error if the substring doesn't exist, which is (hand-wavingly) analogous to returning null (though I'm not at all convinced that we should follow suit and have two almost identical methods ;). Would CSE eliminate the cost of running a potenitially expensive expression twice? (Once to check if it's null, then once more to apply the find if it isn't?)

@ritchie46, want to tie-break this one? There are pros/cons to both (I like distinguishing between "not found" and "applied to null" at a glance), but @orlp's point about combining with offsets is quite compelling 🤔

alexander-beedie avatar Jan 09 '24 16:01 alexander-beedie

@alexander-beedie Another argument in favour of null is if you try to compute any statistics on the found index, e.g. 'average text length before the first link' with pl.col(c).str.find("https?://").mean(). Including -1 there would give bad results.

orlp avatar Jan 09 '24 16:01 orlp

@alexander-beedie Another argument in favour of null is if you try to compute any statistics on the found index, e.g. 'average text length before the first link' with pl.col(c).str.find("https?://").mean(). Including -1 there would give bad results.

I feel myself being persuaded 🤣

alexander-beedie avatar Jan 09 '24 16:01 alexander-beedie

Even more disastrous is the minimum, e.g. trying to find the webpage with the shortest header: pl.col("html").str.find("</head>").arg_min().

orlp avatar Jan 09 '24 16:01 orlp

The tipping point has been reached; I'm changing it 😆 (For bonus points the result type can be downsized from an i64 to a u32 now it doesn't have to handle a negative value).

alexander-beedie avatar Jan 09 '24 16:01 alexander-beedie

Done; further streamlined a few things while I was at it... Thanks for the suggestions guys 👌

alexander-beedie avatar Jan 09 '24 19:01 alexander-beedie