polars icon indicating copy to clipboard operation
polars copied to clipboard

str.split() should support regex

Open indigoviolet opened this issue 3 years ago • 2 comments

Problem Description

I want to tokenize a string column, and there are multiple split characters; I believe my current options are to

  • .apply()
  • go through multiple explode()/str.split passes
  • chain a bunch of flatten() and str.split()

It would be nicer to have rsplit or regex support in split itself (contains, replace both already support it).

It would be also nice to have list-flattening support (ie not explode but taking a nested list and making it unnested).

indigoviolet avatar Sep 11 '22 02:09 indigoviolet

As a work around, can you replace the regex with something static and then split on that?

Like with_column(pl.col(yourcol).str.replace('\d{1,2}','|D|D|D|D').str.split('|D|D|D|D'))

deanm0000 avatar Oct 20 '22 16:10 deanm0000

Just bumped into this.

Workaround was to use .extract_all() then .replace() which is mostly equivalent.

df = pl.DataFrame({
   "data": [ "AB one ABB two ABBBBBB three ABBBBBBBB"]
})

pattern = r"AB+"

df.select(
   pl.col("data")
     .str.extract_all(rf".*?({pattern}|$)")
     .arr.eval(
        pl.all().str.replace(pattern, ""),
        parallel=True)
)

shape: (1, 1)
┌──────────────────────────────┐
│ data                         │
│ ---                          │
│ list[str]                    │
╞══════════════════════════════╡
│ ["", " one ", ... " three "] │
└──────────────────────────────┘

Seems like it could be useful if it worked like the other .extract() / .replace() methods with a literal: bool option to disable regex matching.

cmdlineluser avatar Dec 22 '22 10:12 cmdlineluser

python split works a bit differently than polars split, whereby multiple split characters are removed in the former.

In python: hello world becomes: ['hello', 'world']

if you split on space whereas in polars there would be multiple list entries for each space. at times it is helpful to handle multiple split characters in a row though.

evbo avatar Sep 23 '23 15:09 evbo

@evbo That's only if you do not supply a sep is it not?

'hello    world'.split() # sep=None
# ['hello', 'world']

'hello    world'.split(' ')
# ['hello', '', '', '', 'world']

pl.select(pl.lit('hello    world').str.split(' ')).item()
# shape: (5,)
# Series: '' [str]
# [
# 	"hello"
# 	""
# 	""
# 	""
# 	"world"
# ]

cmdlineluser avatar Sep 23 '23 15:09 cmdlineluser

@cmdlineluser thanks, I should have clarified for the Rust API this is not currently (documented as) supported by the API. If you try to pass lit(Null {}) to split it will complain it must have a UTF8 Expr.

SchemaMismatch( ErrString( "invalid series dtype: expected Utf8, got null", ), )

evbo avatar Sep 23 '23 20:09 evbo

I found this which worked well for my case: https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.Expr.str.extract_groups.html I did: extract_groups(pattern).struct.rename_fields("a", "b", "c").alias("fields") And then unnest("fields")

TheWizier avatar Nov 30 '23 13:11 TheWizier

I would accept a PR on this. If we can keep the non-regex fast path.

ritchie46 avatar Jan 02 '24 13:01 ritchie46

Also the regex parser used by polars doesn't appear to support look-ahead/look-behind which I feel is important for splitting - i.e. I often want to split on a zero-length token, for example between text and numbers etc.

ComputeError: regex error: regex parse error:
    .*?((?<=[a-zA-Z])(?=\d)|$)
        ^^^^
error: look-around, including look-ahead and look-behind, is not supported

Note this is part of a regex I use frequently in a huggingface (i.e. rust backed) tokenizer so the regex engine they use supports look-around.

Edit: hugginface use onigruma rather than the rust regex engine - https://github.com/huggingface/tokenizers/issues/1057

david-waterworth avatar Mar 06 '24 04:03 david-waterworth

@david-waterworth I think they picked the one they did because look arounds are relatively slow as they're recursive. One could build a plugin that used the other regex engine.

deanm0000 avatar Mar 06 '24 13:03 deanm0000