velox
velox copied to clipboard
Optimize like function for exact, prefix and suffix patterns
Optimize like function for the following common cases:
- Constant pattern search: Simple string comparison character by character
- Pattern has constant prefix followed by one or more %: Match the prefix only
- Pattern has constant suffix preceded by one or more %: Match the suffix only
These cases are most common in TPC-H queries.
ideas from DuckDb implementation
Detail it out, what is the idea and how is it optimized with some example.
Please add more details about the KMP algorithm in the description of the PR.
Deploy Preview for meta-velox canceled.
Name | Link |
---|---|
Latest commit | 2ac0aa7c0b02e984586e27315e84fe985196a116 |
Latest deploy log | https://app.netlify.com/sites/meta-velox/deploys/6324a7a561e5f4000816f345 |
Thanks for the inputs @mbasmanova. The benchmark changes are in a separate PR (#1838), shall I integrate it with these changes? The following benchmarks were obtained:
| 4000 runs (time in minutes) | | | |
|:---------------------------:|:------------------:|:---------------------:|:---------------:|
| Query | With optimizations | Without optimizations | Improvement (%) |
| q2 | 0.93 | 1.15 | 19.13 |
| q9 | 1.17 | 1.39 | 15.83 |
| q13 | 2.14 | 2.13 | -0.47 |
| q14 | 0.852 | 0.76 | -12.11 |
| q16_1 | 0.846 | 0.843 | -0.36 |
| q16_2 | 2.1 | 2.12 | 0.94 |
| q20 | 0.775 | 0.795 | 2.52 |
@pramodsatya Re: benchmarks - I was hoping we can create a standalone benchmark that runs LIKE operator for a few vectors and compares the performance of RE2-based implementation with optimized implementation.
Query-wide benchmarks are also useful, but would be outside of the scope of this PR. BTW, I see that some queries regressed with the optimization. That's surprising. Do you know why is that?
The benchmark is added and the following numbers were obtained:
| No. of runs | Pattern Kind | With optimization PR (s) | Without optimization PR (s) | % Change |
|-------------|---------------|--------------------------|-----------------------------|----------|
| 500 | Wildcard only | 14.76 | 15.13 | 2.45 |
| | Fixed | 4.2 | 4.17 | -0.72 |
| | Prefix | 9.92 | 12.5 | 20.64 |
| | Suffix | 8.44 | 8.88 | 4.95 |
| 1000 | Wildcard only | 24.44 | 23.73 | -2.99 |
| | Fixed | 6.16 | 6.49 | 5.08 |
| | Prefix | 14.24 | 18.08 | 21.24 |
| | Suffix | 13.11 | 14.09 | 6.96 |
| 5000 | Wildcard only | 109.2 | 114.6 | 4.71 |
| | Fixed | 31.45 | 32.72 | 3.88 |
| | Prefix | 72.6 | 91.2 | 20.39 |
| | Suffix | 64.8 | 70.8 | 8.47 |
I am not sure why the performance for query 14 regressed with the changes, the data for 1000 runs computed earlier for Tpch table columns doesn't show a performance degradation. The Tpch benchmarks shared earlier were computed about two months ago and it needs to be checked again with the current PR.
| 1000 runs (time in seconds) | | | |
|:---------------------------:|:------------------:|:---------------------:|:-----------:|
| Query | With optimizations | Without optimizations | Savings (%) |
| q2 | 15.16 | 24.84 | 38.97 |
| q9 | 18.39 | 22.37 | 17.79 |
| q13 | 37.24 | 35.3 | -5.5 |
| q14 | 11.98 | 12.42 | 3.54 |
| q16_1 | 11.88 | 12.65 | 6.09 |
| q16_2 | 32.57 | 33.49 | 2.75 |
| q20 | 12.17 | 11.84 | -2.79 |
@pramodsatya
The following table is confusing to me. What are there two separate columns "Savings (%)" and "% Change". What values do they show?
I'm not seeing regression for q14. Time w/ opt is 11.98 and w/o 12.42. Where do you see the regression? q13 and q20 numbers do suggest a regression. Would you investigate? CC: @aditi-pandit
I am not sure why the performance for query 14 regressed with the changes, the data for 1000 runs computed earlier for Tpch table columns doesn't show a performance degradation. The Tpch benchmarks shared earlier were computed about two months ago and it needs to be checked again with the current PR.
| 1000 runs (time in seconds) | | | | % Change |
|:---------------------------:|:------------------:|:---------------------:|:-----------:|----------|
| Query | With optimizations | Without optimizations | Savings (%) | 2.45 |
| q2 | 15.16 | 24.84 | 38.97 | -0.72 |
| q9 | 18.39 | 22.37 | 17.79 | 20.64 |
| q13 | 37.24 | 35.3 | -5.5 | 4.95 |
| q14 | 11.98 | 12.42 | 3.54 | -2.99 |
| q16_1 | 11.88 | 12.65 | 6.09 | 5.08 |
| q16_2 | 32.57 | 33.49 | 2.75 | 21.24 |
| q20 | 12.17 | 11.84 | -2.79 | 6.96 |
Apologies for the confusion, the column "Savings (%)" has the right numbers and the column "% Change" was inadvertently copied from the previous table. I have updated the table now. Yes, q13 does show a regression in both runs, I will obtain the benchmarks with the current changes for Tpch tables and look into why the regression is seen.
@pramodsatya q9 shows a gain of 17%, but is uses LIKE pattern that is not being optimized: p.name like '%green%'
. How is this possible?
For q9, the gain is seen because these benchmarks were obtained with this PR and the changes from #2453 combined. The changes were later split into two separate PRs and subsequently revised in accordance with the review comments. I will obtain the benchmarks with the current changes after addressing the pending review comments and share them shortly. Thank you and apologies for the confusion.
@pramodsatya The regressions might be due to a new switch in the hot path. A fix could be to turn LikeConstantPattern into a template parameterized on PatternKind and move pattern analysis logic into makeLike.
@pramodsatya The regressions might be due to a new switch in the hot path. A fix could be to turn LikeConstantPattern into a template parameterized on PatternKind and move pattern analysis logic into makeLike.
The benchmarks for Tpch queries having the like operator and patterns optimized for in this PR are as follows:
| No. of rows | Query Number | With optimization PR (s) | Without optimization PR (s) | % Change |
|-------------|--------------|--------------------------|-----------------------------|----------|
| 1000 | 2 | 13.78 | 17.74 | 22.32 |
| | 14 | 11.95 | 12.49 | 4.32 |
| | 16 | 12.12 | 12.48 | 2.88 |
| | 20 | 11.97 | 12.57 | 4.77 |
| 2000 | 2 | 26.59 | 32.2 | 17.42 |
| | 14 | 23.1 | 23.72 | 2.61 |
| | 16 | 23.46 | 23.71 | 1.05 |
| | 20 | 23.28 | 23.36 | 0.34 |
| 4000 | 2 | 51.33 | 63.6 | 19.29 |
| | 14 | 46.09 | 47.32 | 2.6 |
| | 16 | 46.84 | 47.15 | 0.66 |
| | 20 | 46.09 | 46.56 | 1.01 |
| 8000 | 2 | 104.4 | 136.2 | 23.35 |
| | 14 | 91.8 | 95.4 | 3.77 |
| | 16 | 93 | 95.4 | 2.52 |
| | 20 | 92.4 | 94.2 | 1.91 |
Query 2 shows the most improvement in performance at ~20%. This is consistent with the Tpch benchmarks obtained with the earlier version of this PR, where query 2 and 9 had the most gains. Number of rows here is the vector size of the Tpch table generated with helper functions from TpchGen.cpp. It is better to have LikeConstantPattern templated on PatternKind as suggested earlier, instead of the switch case on patternKind in the match function:
| No. of rows | Query Number | With optimization PR (s) | Without optimization PR (s) | Without templated LikeConstantPattern (s) |
|-------------|--------------|--------------------------|-----------------------------|-------------------------------------------|
| 8000 | 2 | 104.4 | 136.2 | 109.8 |
| | 14 | 91.8 | 95.4 | 97.8 |
| | 16 | 93 | 95.4 | 100.2 |
| | 20 | 92.4 | 94.2 | 91.8 |
@pramodsatya Thank you for the update on the benchmarks.
This is consistent with the Tpch benchmarks obtained with the earlier version of this PR, where query 2 and 9 had the most gains.
The new results do not include q9. Any particular reason it is not included?
Q9 is not included here because the pattern in Q9, '%green%', is not targeted in this PR. Patterns seen in Q9, Q13, Q16 ('%Customer%Complaints%') and Q20 are optimized for in #2453.
Thanks benchmarks looks good to me!
@laithsakka has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
are numbers in the summary up to date? @pramodsatya
Hi @laithsakka, the numbers in summary are from the version where parsing of fmt::format("like(c0, '{}')", patternString) was not moved within the suspender block. I will update the numbers with the latest changes now. Thanks!
Updated numbers in the summary, please let me know if any further changes are needed, thanks!
@laithsakka has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
Updated numbers in the summary, please let me know if any further changes are needed, thanks!
Thanks, the numbers look amazing!
@laithsakka has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.