Add additional regexp function regexp_count()
Which issue does this PR close?
Closes #12079 and part of #11946
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?
Finalizing the details takes a lot more effort than expected. Would you like to take a look? Thanks! @alamb @jayzhan211
Benchmark
regexp_count_1000 string
time: [6.6158 ms 6.6634 ms 6.7108 ms]
regexp_count_1000 utf8view
time: [6.7117 ms 6.7647 ms 6.8183 ms]
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild
regexp_like_1000 time: [3.7056 ms 3.7170 ms 3.7289 ms]
change: [-5.9843% -5.0861% -4.1952%] (p = 0.00 < 0.05)
Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
4 (4.00%) high mild
regexp_match_1000 time: [4.4132 ms 4.4287 ms 4.4466 ms]
change: [-9.8318% -8.4331% -7.0280%] (p = 0.00 < 0.05)
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
3 (3.00%) high mild
2 (2.00%) high severe
regexp_replace_1000 time: [3.4351 ms 3.4697 ms 3.5142 ms]
change: [-13.734% -11.848% -10.019%] (p = 0.00 < 0.05)
Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
1 (1.00%) high mild
6 (6.00%) high severe
Thanks for the very nice PR @xinlifoobar ! I'll try and take the time to do a full review of this PR next week if no one beats me to it.
I think this is a good PR and worthy of merging into main. My only thoughts are some small things noted in my comments and the fact that counts seems to be twice as slow as like and replace. I was able to reproduce the benchmark but unfortunately I cannot run flamegraph on my machine (perf and WSL is a black art) so I wasn't really able to narrow down the cause.
@alamb - since @xinlifoobar seems to be dormant my thoughts on this PR is to merge it in and file a couple of tickets to improve it, primarily the performance discrepancy.
Makes sense to me -- thank you @Omega359 -- would you be willing to make a new PR (merged/ rebased to fix the conflicts on main)? I can help with the review / merge / file follow on tickets.
Sure thing. I'll hopefully work on it this weekend after #12149
Merged in https://github.com/apache/datafusion/pull/12970