velox Optimize like function for exact, prefix and suffix patterns

Optimize like function for the following common cases:

Constant pattern search: Simple string comparison character by character
Pattern has constant prefix followed by one or more %: Match the prefix only
Pattern has constant suffix preceded by one or more %: Match the suffix only

These cases are most common in TPC-H queries.

Jun 04 '22 00:06 pramodsatya

ideas from DuckDb implementation

Detail it out, what is the idea and how is it optimized with some example.

Jun 16 '22 17:06 karteekmurthys

Please add more details about the KMP algorithm in the description of the PR.

Jun 23 '22 17:06 aditi-pandit

Deploy Preview for meta-velox canceled.

Name	Link
Latest commit	2ac0aa7c0b02e984586e27315e84fe985196a116
Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/6324a7a561e5f4000816f345

Aug 25 '22 22:08 netlify[bot]

Thanks for the inputs @mbasmanova. The benchmark changes are in a separate PR (#1838), shall I integrate it with these changes? The following benchmarks were obtained:

| 4000 runs (time in minutes) |                    |                       |                 |   
|:---------------------------:|:------------------:|:---------------------:|:---------------:|
|            Query            | With optimizations | Without optimizations | Improvement (%) |   
|              q2             |        0.93        |          1.15         |      19.13      |   
|              q9             |        1.17        |          1.39         |      15.83      |   
|             q13             |        2.14        |          2.13         |      -0.47      |   
|             q14             |        0.852       |          0.76         |      -12.11     |   
|            q16_1            |        0.846       |         0.843         |      -0.36      |   
|            q16_2            |         2.1        |          2.12         |       0.94      |   
|             q20             |        0.775       |         0.795         |       2.52      |

Aug 30 '22 22:08 pramodsatya

@pramodsatya Re: benchmarks - I was hoping we can create a standalone benchmark that runs LIKE operator for a few vectors and compares the performance of RE2-based implementation with optimized implementation.

Query-wide benchmarks are also useful, but would be outside of the scope of this PR. BTW, I see that some queries regressed with the optimization. That's surprising. Do you know why is that?

Aug 30 '22 22:08 mbasmanova

The benchmark is added and the following numbers were obtained:

| No. of runs | Pattern Kind  | With optimization PR (s) | Without optimization PR (s) | % Change |
|-------------|---------------|--------------------------|-----------------------------|----------|
|         500 | Wildcard only |                    14.76 |                       15.13 |     2.45 |
|             | Fixed         |                      4.2 |                        4.17 |    -0.72 |
|             | Prefix        |                     9.92 |                        12.5 |    20.64 |
|             | Suffix        |                     8.44 |                        8.88 |     4.95 |
|        1000 | Wildcard only |                    24.44 |                       23.73 |    -2.99 |
|             | Fixed         |                     6.16 |                        6.49 |     5.08 |
|             | Prefix        |                    14.24 |                       18.08 |    21.24 |
|             | Suffix        |                    13.11 |                       14.09 |     6.96 |
|        5000 | Wildcard only |                    109.2 |                       114.6 |     4.71 |
|             | Fixed         |                    31.45 |                       32.72 |     3.88 |
|             | Prefix        |                     72.6 |                        91.2 |    20.39 |
|             | Suffix        |                     64.8 |                        70.8 |     8.47 |

I am not sure why the performance for query 14 regressed with the changes, the data for 1000 runs computed earlier for Tpch table columns doesn't show a performance degradation. The Tpch benchmarks shared earlier were computed about two months ago and it needs to be checked again with the current PR.

| 1000 runs (time in seconds) |                    |                       |             |
|:---------------------------:|:------------------:|:---------------------:|:-----------:|
|            Query            | With optimizations | Without optimizations | Savings (%) |
|              q2             |        15.16       |         24.84         |    38.97    |
|              q9             |        18.39       |         22.37         |    17.79    |
|             q13             |        37.24       |          35.3         |     -5.5    | 
|             q14             |        11.98       |         12.42         |     3.54    |
|            q16_1            |        11.88       |         12.65         |     6.09    |
|            q16_2            |        32.57       |         33.49         |     2.75    |
|             q20             |        12.17       |         11.84         |    -2.79    |

Sep 01 '22 21:09 pramodsatya

@pramodsatya

The following table is confusing to me. What are there two separate columns "Savings (%)" and "% Change". What values do they show?

I'm not seeing regression for q14. Time w/ opt is 11.98 and w/o 12.42. Where do you see the regression? q13 and q20 numbers do suggest a regression. Would you investigate? CC: @aditi-pandit

I am not sure why the performance for query 14 regressed with the changes, the data for 1000 runs computed earlier for Tpch table columns doesn't show a performance degradation. The Tpch benchmarks shared earlier were computed about two months ago and it needs to be checked again with the current PR.

| 1000 runs (time in seconds) |                    |                       |             | % Change |
|:---------------------------:|:------------------:|:---------------------:|:-----------:|----------|
|            Query            | With optimizations | Without optimizations | Savings (%) |     2.45 |
|              q2             |        15.16       |         24.84         |    38.97    |    -0.72 |
|              q9             |        18.39       |         22.37         |    17.79    |    20.64 |
|             q13             |        37.24       |          35.3         |     -5.5    |     4.95 |
|             q14             |        11.98       |         12.42         |     3.54    |    -2.99 |
|            q16_1            |        11.88       |         12.65         |     6.09    |     5.08 |
|            q16_2            |        32.57       |         33.49         |     2.75    |    21.24 |
|             q20             |        12.17       |         11.84         |    -2.79    |     6.96 |

Sep 06 '22 15:09 mbasmanova

Apologies for the confusion, the column "Savings (%)" has the right numbers and the column "% Change" was inadvertently copied from the previous table. I have updated the table now. Yes, q13 does show a regression in both runs, I will obtain the benchmarks with the current changes for Tpch tables and look into why the regression is seen.

Sep 06 '22 15:09 pramodsatya

@pramodsatya q9 shows a gain of 17%, but is uses LIKE pattern that is not being optimized: p.name like '%green%'. How is this possible?

Sep 06 '22 16:09 mbasmanova

For q9, the gain is seen because these benchmarks were obtained with this PR and the changes from #2453 combined. The changes were later split into two separate PRs and subsequently revised in accordance with the review comments. I will obtain the benchmarks with the current changes after addressing the pending review comments and share them shortly. Thank you and apologies for the confusion.

Sep 06 '22 16:09 pramodsatya

@pramodsatya The regressions might be due to a new switch in the hot path. A fix could be to turn LikeConstantPattern into a template parameterized on PatternKind and move pattern analysis logic into makeLike.

Sep 06 '22 16:09 mbasmanova

@pramodsatya The regressions might be due to a new switch in the hot path. A fix could be to turn LikeConstantPattern into a template parameterized on PatternKind and move pattern analysis logic into makeLike.

Sep 06 '22 16:09 mbasmanova

The benchmarks for Tpch queries having the like operator and patterns optimized for in this PR are as follows:

| No. of rows | Query Number | With optimization PR (s) | Without optimization PR (s) | % Change | 
|-------------|--------------|--------------------------|-----------------------------|----------|
|        1000 |            2 |                    13.78 |                       17.74 |    22.32 | 
|             |           14 |                    11.95 |                       12.49 |     4.32 |     
|             |           16 |                    12.12 |                       12.48 |     2.88 |     
|             |           20 |                    11.97 |                       12.57 |     4.77 |     
|        2000 |            2 |                    26.59 |                        32.2 |    17.42 |
|             |           14 |                     23.1 |                       23.72 |     2.61 |     
|             |           16 |                    23.46 |                       23.71 |     1.05 |    
|             |           20 |                    23.28 |                       23.36 |     0.34 |    
|        4000 |            2 |                    51.33 |                        63.6 |    19.29 | 
|             |           14 |                    46.09 |                       47.32 |      2.6 |      
|             |           16 |                    46.84 |                       47.15 |     0.66 |     
|             |           20 |                    46.09 |                       46.56 |     1.01 |     
|        8000 |            2 |                    104.4 |                       136.2 |    23.35 |
|             |           14 |                     91.8 |                        95.4 |     3.77 | 
|             |           16 |                       93 |                        95.4 |     2.52 | 
|             |           20 |                     92.4 |                        94.2 |     1.91 |

Query 2 shows the most improvement in performance at ~20%. This is consistent with the Tpch benchmarks obtained with the earlier version of this PR, where query 2 and 9 had the most gains. Number of rows here is the vector size of the Tpch table generated with helper functions from TpchGen.cpp. It is better to have LikeConstantPattern templated on PatternKind as suggested earlier, instead of the switch case on patternKind in the match function:

| No. of rows | Query Number | With optimization PR (s) | Without optimization PR (s) | Without templated LikeConstantPattern (s) | 
|-------------|--------------|--------------------------|-----------------------------|-------------------------------------------|
|        8000 |            2 |                    104.4 |                       136.2 |                                     109.8 |    
|             |           14 |                     91.8 |                        95.4 |                                      97.8 |     
|             |           16 |                       93 |                        95.4 |                                     100.2 |    
|             |           20 |                     92.4 |                        94.2 |                                      91.8 |

Sep 09 '22 04:09 pramodsatya

@pramodsatya Thank you for the update on the benchmarks.

This is consistent with the Tpch benchmarks obtained with the earlier version of this PR, where query 2 and 9 had the most gains.

The new results do not include q9. Any particular reason it is not included?

Sep 09 '22 04:09 mbasmanova

Q9 is not included here because the pattern in Q9, '%green%', is not targeted in this PR. Patterns seen in Q9, Q13, Q16 ('%Customer%Complaints%') and Q20 are optimized for in #2453.

Sep 09 '22 04:09 pramodsatya

Thanks benchmarks looks good to me!

Sep 19 '22 17:09 laithsakka

@laithsakka has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Sep 20 '22 14:09 facebook-github-bot

are numbers in the summary up to date? @pramodsatya

Sep 20 '22 22:09 laithsakka

Hi @laithsakka, the numbers in summary are from the version where parsing of fmt::format("like(c0, '{}')", patternString) was not moved within the suspender block. I will update the numbers with the latest changes now. Thanks!

Sep 20 '22 22:09 pramodsatya

Updated numbers in the summary, please let me know if any further changes are needed, thanks!

Sep 20 '22 23:09 pramodsatya

@laithsakka has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Sep 21 '22 18:09 facebook-github-bot

Updated numbers in the summary, please let me know if any further changes are needed, thanks!

Thanks, the numbers look amazing!

Sep 21 '22 18:09 laithsakka

@laithsakka has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Sep 21 '22 18:09 facebook-github-bot

velox velox copied to clipboard

Optimize like function for exact, prefix and suffix patterns

✅ Deploy Preview for meta-velox canceled.

velox
velox copied to clipboard

Deploy Preview for meta-velox canceled.