datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Add additional regexp function regexp_count()

Open xinlifoobar opened this issue 1 year ago • 1 comments

Which issue does this PR close?

Closes #12079 and part of #11946

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

xinlifoobar avatar Aug 20 '24 13:08 xinlifoobar

Finalizing the details takes a lot more effort than expected. Would you like to take a look? Thanks! @alamb @jayzhan211

Benchmark

regexp_count_1000 string
                        time:   [6.6158 ms 6.6634 ms 6.7108 ms]

regexp_count_1000 utf8view
                        time:   [6.7117 ms 6.7647 ms 6.8183 ms]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

regexp_like_1000        time:   [3.7056 ms 3.7170 ms 3.7289 ms]
                        change: [-5.9843% -5.0861% -4.1952%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

regexp_match_1000       time:   [4.4132 ms 4.4287 ms 4.4466 ms]
                        change: [-9.8318% -8.4331% -7.0280%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

regexp_replace_1000     time:   [3.4351 ms 3.4697 ms 3.5142 ms]
                        change: [-13.734% -11.848% -10.019%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) high mild
  6 (6.00%) high severe

xinlifoobar avatar Aug 29 '24 05:08 xinlifoobar

Thanks for the very nice PR @xinlifoobar ! I'll try and take the time to do a full review of this PR next week if no one beats me to it.

Omega359 avatar Sep 28 '24 22:09 Omega359

I think this is a good PR and worthy of merging into main. My only thoughts are some small things noted in my comments and the fact that counts seems to be twice as slow as like and replace. I was able to reproduce the benchmark but unfortunately I cannot run flamegraph on my machine (perf and WSL is a black art) so I wasn't really able to narrow down the cause.

Omega359 avatar Oct 02 '24 22:10 Omega359

@alamb - since @xinlifoobar seems to be dormant my thoughts on this PR is to merge it in and file a couple of tickets to improve it, primarily the performance discrepancy.

Omega359 avatar Oct 10 '24 15:10 Omega359

Makes sense to me -- thank you @Omega359 -- would you be willing to make a new PR (merged/ rebased to fix the conflicts on main)? I can help with the review / merge / file follow on tickets.

alamb avatar Oct 11 '24 14:10 alamb

Sure thing. I'll hopefully work on it this weekend after #12149

Omega359 avatar Oct 11 '24 14:10 Omega359

Merged in https://github.com/apache/datafusion/pull/12970

alamb avatar Oct 18 '24 20:10 alamb