feat: support the SQL RLIKE expression
close #862
Development
- [x] SQL syntax https://github.com/4paradigm/zetasql/pull/41
- [x] extend BinaryExpr to support RLIKE type
- [x] add regexp_like builtin function
- [x] Codegen: deal with RLIKE BinaryExpr
Test
- [x] ~~logic plan test in ast_node_converter_test to verify correctly plan generated for rlike expr (discuss needed)~~
- [x] unit test for new functions in udf.cc -> udf_test.cc
- [x] udf_ir_builder_test for the registered function regexp_like udf functions
- [x] expr_ir_builder_test for the supported rlike expression
- [ ] and end2end, a full sql example to test correctly for both rlike expression and regexp_like function 99% (discuss needed)
Codecov Report
Merging #2187 (e7f1851) into main (f368158) will increase coverage by
0.01%. The diff coverage is90.90%.
@@ Coverage Diff @@
## main #2187 +/- ##
============================================
+ Coverage 75.87% 75.88% +0.01%
Complexity 368 368
============================================
Files 629 629
Lines 119285 119495 +210
Branches 1022 1022
============================================
+ Hits 90509 90684 +175
- Misses 28560 28595 +35
Partials 216 216
| Impacted Files | Coverage Δ | |
|---|---|---|
| hybridse/src/codegen/expr_ir_builder.h | 100.00% <ø> (ø) |
|
| hybridse/src/udf/udf.h | 100.00% <ø> (ø) |
|
| hybridse/src/codegen/expr_ir_builder.cc | 78.30% <79.54%> (+0.08%) |
:arrow_up: |
| hybridse/src/udf/udf.cc | 85.64% <84.61%> (-0.06%) |
:arrow_down: |
| hybridse/src/udf/udf_test.cc | 97.63% <92.98%> (-0.43%) |
:arrow_down: |
| hybridse/include/node/sql_node.h | 83.50% <100.00%> (+0.03%) |
:arrow_up: |
| hybridse/src/codegen/expr_ir_builder_test.cc | 98.27% <100.00%> (+0.03%) |
:arrow_up: |
| hybridse/src/codegen/udf_ir_builder_test.cc | 99.52% <100.00%> (+0.01%) |
:arrow_up: |
| hybridse/src/node/expr_node.cc | 82.32% <100.00%> (+0.36%) |
:arrow_up: |
| hybridse/src/planv2/ast_node_converter.cc | 94.43% <100.00%> (+0.01%) |
:arrow_up: |
| ... and 23 more |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
SDK Test Report
79 files 79 suites 6m 10s :stopwatch: 167 tests 165 :heavy_check_mark: 2 :zzz: 0 :x: 209 runs 207 :heavy_check_mark: 2 :zzz: 0 :x:
Results for commit e7f18516.
:recycle: This comment has been updated with latest results.
HybridSE Linux Test Report
19 669 tests 19 667 :heavy_check_mark: 5m 27s :stopwatch: 237 suites 2 :zzz: 67 files 0 :x:
Results for commit e7f18516.
:recycle: This comment has been updated with latest results.
HybridSE Mac Test Report
19 669 tests 19 667 :heavy_check_mark: 7m 14s :stopwatch: 237 suites 2 :zzz: 67 files 0 :x:
Results for commit e7f18516.
:recycle: This comment has been updated with latest results.
Linux Test Report
55 files 200 suites 52m 20s :stopwatch: 9 334 tests 9 327 :heavy_check_mark: 7 :zzz: 0 :x: 13 725 runs 13 718 :heavy_check_mark: 7 :zzz: 0 :x:
Results for commit df1007d2.
:recycle: This comment has been updated with latest results.
@jiang1997 thank you for your contribution. It seems you are still working on it, we suggest you convert it to draft first? Let us know when you are ready to be reviewed.
RLIKE keyword is not supported by sql parser yet, you can just working on the regexp_like udf function here.
Feel free to discuss
RLIKEkeyword is not supported by sql parser yet, you can just working on theregexp_likeudf function here. Feel free to discuss
Thanks. Actually, I'm also interested in the parser part.
@jiang1997 thank you for your contribution. It seems you are still working on it, we suggest you convert it to draft first? Let us know when you are ready to be reviewed.
Got it.
RLIKEkeyword is not supported by sql parser yet, you can just working on theregexp_likeudf function here. Feel free to discussThanks. Actually, I'm also interested in the parser part.
If u like afterwards, have a reference for ilike: https://github.com/4paradigm/zetasql/pull/30.
Probably need a few time to understand how build :)
@aceforeverd
RLIKEkeyword is not supported by sql parser yet, you can just working on theregexp_likeudf function here. Feel free to discussThanks. Actually, I'm also interested in the parser part.
If u like afterwards, have a reference for
ilike: 4paradigm/zetasql#30. Probably need a few time to understand how build :)
This will help a lot.
I encountered an issue. With the regex of STL, there is a feature that can't be implemented in the normal way. refers to https://docs.snowflake.com/en/sql-reference/functions-regexp.html#label-regexp-parameters-argument
| Parameter | Effect |
|---|---|
| c | Enables case-sensitive matching. |
| i | Enables case-insensitive matching. |
| m | Enables multi-line mode (i.e. meta-characters ^ and $ mark the beginning and end of any line of the subject). By default, multi-line mode is disabled (i.e. ^ and $ mark the beginning and end of the entire subject). |
| e | Extracts sub-matches; applies only to REGEXP_INSTR, REGEXP_SUBSTR, REGEXP_SUBSTR_ALL, and the aliases for these functions. |
| s | Enables the POSIX wildcard character . to match \n. By default, wildcard character matching is disabled. |
This says by default the character '.' of a pattern can't match '\n'. But when using regex of STL, '.' always matches '\n'. Considering it's just the first version/draft, could we ignore the flag s?
I encountered an issue. With the regex of STL, there is a feature that can't be implemented in the normal way. refers to https://docs.snowflake.com/en/sql-reference/functions-regexp.html#label-regexp-parameters-argument
Parameter Effect c Enables case-sensitive matching. i Enables case-insensitive matching. m Enables multi-line mode (i.e. meta-characters ^ and $ mark the beginning and end of any line of the subject). By default, multi-line mode is disabled (i.e. ^ and $ mark the beginning and end of the entire subject). e Extracts sub-matches; applies only to REGEXP_INSTR, REGEXP_SUBSTR, REGEXP_SUBSTR_ALL, and the aliases for these functions. s
Enables the POSIX wildcard character . to match \n. By default, wildcard character matching is disabled.This says by default the character '.' of a pattern can't match '\n'. But when using regex of STL, '.' always matches '\n'. Considering it's just the first version/draft, could we ignore the flag s?
posix regex do not match, stl regex construct regex default with ECMAScript, maybe that's the difference ? see https://en.cppreference.com/w/cpp/regex/basic_regex
@aceforeverd Yea, I thought so too at first. Below description is quoted from this link https://en.cppreference.com/w/cpp/regex/basic_regex.
At most one grammar option must be chosen out of ECMAScript, basic, extended, awk, grep, egrep.
If no grammar is chosen, ECMAScript is assumed to be selected.
I actually tried all of these options and none of them complies with the behaviour we want. Here is the code used to test.
#include <iostream>
#include <string>
#include <regex>
int main()
{
std::string s = "a\nb";
std::regex self_regex("a.b", std::regex_constants::basic);
if (std::regex_match(s, self_regex)) {
std::cout << s << endl;
}
return 0;
}
Before this problem is pinned down, I will do other work first.
@aceforeverd Yea, I thought so too at first. Below description is quoted from this link https://en.cppreference.com/w/cpp/regex/basic_regex.
At most one grammar option must be chosen out of ECMAScript, basic, extended, awk, grep, egrep. If no grammar is chosen, ECMAScript is assumed to be selected.I actually tried all of these options and none of them complies with the behaviour we want. Here is the code used to test.
#include <iostream> #include <string> #include <regex> int main() { std::string s = "a\nb"; std::regex self_regex("a.b", std::regex_constants::basic); if (std::regex_match(s, self_regex)) { std::cout << s << endl; } return 0; }Before this problem is pinned down, I will do other work first.
How about ECMAScript ? In my mac, it will not match newline
int main() {
std::string s = "a\nb";
std::regex self_regex("a.b", std::regex_constants::ECMAScript);
if (std::regex_match(s, self_regex)) {
std::cout << "matched" << std::endl;
std::cout << s << std::endl;
} else {
std::cout << "not match" << std::endl;
}
return 0;
}
posix regex 's <period> match <newline>, both BRE and ERE, see
- https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03:~:text=as%20an%20anchor.-,9.3.4,-Periods%20in%20BREs
- https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_04:~:text=as%20an%20anchor.-,9.4.4%20Periods%20in%20EREs,-A%20%3Cperiod%3E%20(
e flag can ignored here ?
And do you think if any other thirdparty lib works here, e.g google's re2 lib has the i, m, s` flags that seems same here.
I was told std::regex is too complex 🗿 (from internet
@zhanghaohit @vagetablechicken any experienced thoughts ?
@aceforeverd Yea, I thought so too at first. Below description is quoted from this link https://en.cppreference.com/w/cpp/regex/basic_regex.
At most one grammar option must be chosen out of ECMAScript, basic, extended, awk, grep, egrep. If no grammar is chosen, ECMAScript is assumed to be selected.I actually tried all of these options and none of them complies with the behaviour we want. Here is the code used to test.
#include <iostream> #include <string> #include <regex> int main() { std::string s = "a\nb"; std::regex self_regex("a.b", std::regex_constants::basic); if (std::regex_match(s, self_regex)) { std::cout << s << endl; } return 0; }Before this problem is pinned down, I will do other work first.
How about
ECMAScript? In my mac, it will not match newlineint main() { std::string s = "a\nb"; std::regex self_regex("a.b", std::regex_constants::ECMAScript); if (std::regex_match(s, self_regex)) { std::cout << "matched" << std::endl; std::cout << s << std::endl; } else { std::cout << "not match" << std::endl; } return 0; }posix regex 's
<period>match<newline>, both BRE and ERE, see
- https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03:~:text=as%20an%20anchor.-,9.3.4,-Periods%20in%20BREs
- https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_04:~:text=as%20an%20anchor.-,9.4.4%20Periods%20in%20EREs,-A%20%3Cperiod%3E%20(
It's my fault for ignoring ECMAScript. Seems 'e' does not involve in rlike. It's ok to leave it out. I will try google's re2 lib later.
Great. Note re2 already included in thirdparty.
Now it can be built with https://github.com/4paradigm/zetasql/pull/41.
@jiang1997 great job so far! Yet in the meantime, the feature need enough tests to ensure correctness. With the diffs so far, you can add at least:
- logic plan test in
ast_node_converter_testto verify correctly plan generated for rlike expr - unit test for new functions in
udf.cc->udf_test.cc udf_ir_builder_testfor the registered functionregexp_likeudf functionsexpr_ir_builder_testfor the supported rlike expression- and end2end, a full sql example to test correctly for both rlike expression and
regexp_likefunction\
As usual, you can take the like and ilike pr as a reference. Also it's good practice to cover every piece of changed code, with the help of coverage report :)
In the other hand, it may not necessary to test detail rules of regex expression as it just too much, simple a few most important flags and some basic regex match should be fine.
@jiang1997 great job so far! Yet in the meantime, the feature need enough tests to ensure correctness. With the diffs so far, you can add at least:
- logic plan test in
ast_node_converter_testto verify correctly plan generated for rlike expr- unit test for new functions in
udf.cc->udf_test.ccudf_ir_builder_testfor the registered functionregexp_likeudf functionsexpr_ir_builder_testfor the supported rlike expression- and end2end, a full sql example to test correctly for both rlike expression and
regexp_likefunction\As usual, you can take the
likeandilikepr as a reference. Also it's good practice to cover every piece of changed code, with the help of coverage report :)
Thanks for your pointers.
The test case for like inside ast_node_converter_test is related to keyword show. Additional modification is needed in zetasql to add test case for rlike. So should I propose changes to zetasql again?
I found that for the 3 modes batch/request/batch request , when test rlike/regexp_like or like/like_match used in SQL ON clause, everything works fine. But when test in SQL WHERE clause, rlike/regexp_like and like/like_match both failed in request/batch request mode.
Here is how I test in request mode and use like_match inside WHERE clause.
First, create like.yaml as below.
db: test_zw
debugs: []
sqlDialect: ["HybridSQL"]
cases:
- id: 5
desc: udf regexp_like with flags
inputs:
- columns: ["id int", "ts timestamp", "col2 string"]
indexs: ["index1:id:ts"]
rows:
- [1, 1590115420000, "aa"]
- [2, 1590115420000, "bb"]
- [3, 1590115420000, "cc"]
sql: |
select id,
from {0}
where like_match(col2, "aa");
expect:
columns: ["id int"]
order: id
rows:
- [1]
Then run
./build/hybridse/examples/toydb/src/toydb_run_engine --runner_mode "batch request" --yaml_path like.yaml
Get output as below.
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0802 08:15:53.109465 26294 toydb_engine_test_base.cc:71] Init Toy DB Engine & Catalog
I0802 08:15:53.109545 26294 engine_test_base.cc:378] Compile SQL:
select id,
from auto_t0
where like_match(col2, "aa");
I0802 08:15:53.109673 26294 default_udf_library.cc:46] Creating DefaultUdfLibrary
W0802 08:15:53.116494 26294 sql_compiler.cc:329] Fail create sql plan: Non-support kFilterPlan Op in online serving
(At /root/OpenMLDB/hybridse/src/planv2/planner_v2.cc:44)
(At /root/OpenMLDB/hybridse/src/plan/planner.cc:663)
(At /root/OpenMLDB/hybridse/src/plan/planner.cc:503)
(At /root/OpenMLDB/hybridse/src/plan/planner.cc:503)
(At /root/OpenMLDB/hybridse/src/plan/planner.cc:508)
(Caused by) Non-support kFilterPlan Op in online serving
I0802 08:15:53.116533 26294 engine_test_base.cc:398] SQL Compile take 6.98 milliseconds
I0802 08:15:53.116550 26294 engine_test_base.cc:401] Non-support kFilterPlan Op in online serving
(At /root/OpenMLDB/hybridse/src/planv2/planner_v2.cc:44)
(At /root/OpenMLDB/hybridse/src/plan/planner.cc:663)
(At /root/OpenMLDB/hybridse/src/plan/planner.cc:503)
(At /root/OpenMLDB/hybridse/src/plan/planner.cc:503)
(At /root/OpenMLDB/hybridse/src/plan/planner.cc:508)
(Caused by) Non-support kFilterPlan Op in online serving
/root/OpenMLDB/hybridse/src/testing/engine_test_base.cc:427: Failure
Expected equality of these values:
sql_case_.expect().success_
Which is: true
status.isOK()
Which is: false
@aceforeverd
Is there anything I have not done? Or no need to test rlike/regexp_like in WHERE clause.
Yes, no need. It's expected as a compile time error, where clause is not supported in online serving mode ( request mode and batch request mode).
I found that for the 3 modes
batch/request/batch request, when testrlike/regexp_likeorlike/like_matchused in SQLONclause, everything works fine. But when test in SQLWHEREclause,rlike/regexp_likeandlike/like_matchboth failed inrequest/batch requestmode.Here is how I test in
requestmode and uselike_matchinsideWHEREclause. First, createlike.yamlas below.db: test_zw debugs: [] sqlDialect: ["HybridSQL"] cases: - id: 5 desc: udf regexp_like with flags inputs: - columns: ["id int", "ts timestamp", "col2 string"] indexs: ["index1:id:ts"] rows: - [1, 1590115420000, "aa"] - [2, 1590115420000, "bb"] - [3, 1590115420000, "cc"] sql: | select id, from {0} where like_match(col2, "aa"); expect: columns: ["id int"] order: id rows: - [1]Then run
./build/hybridse/examples/toydb/src/toydb_run_engine --runner_mode "batch request" --yaml_path like.yamlGet output as below.
WARNING: Logging before InitGoogleLogging() is written to STDERR I0802 08:15:53.109465 26294 toydb_engine_test_base.cc:71] Init Toy DB Engine & Catalog I0802 08:15:53.109545 26294 engine_test_base.cc:378] Compile SQL: select id, from auto_t0 where like_match(col2, "aa"); I0802 08:15:53.109673 26294 default_udf_library.cc:46] Creating DefaultUdfLibrary W0802 08:15:53.116494 26294 sql_compiler.cc:329] Fail create sql plan: Non-support kFilterPlan Op in online serving (At /root/OpenMLDB/hybridse/src/planv2/planner_v2.cc:44) (At /root/OpenMLDB/hybridse/src/plan/planner.cc:663) (At /root/OpenMLDB/hybridse/src/plan/planner.cc:503) (At /root/OpenMLDB/hybridse/src/plan/planner.cc:503) (At /root/OpenMLDB/hybridse/src/plan/planner.cc:508) (Caused by) Non-support kFilterPlan Op in online serving I0802 08:15:53.116533 26294 engine_test_base.cc:398] SQL Compile take 6.98 milliseconds I0802 08:15:53.116550 26294 engine_test_base.cc:401] Non-support kFilterPlan Op in online serving (At /root/OpenMLDB/hybridse/src/planv2/planner_v2.cc:44) (At /root/OpenMLDB/hybridse/src/plan/planner.cc:663) (At /root/OpenMLDB/hybridse/src/plan/planner.cc:503) (At /root/OpenMLDB/hybridse/src/plan/planner.cc:503) (At /root/OpenMLDB/hybridse/src/plan/planner.cc:508) (Caused by) Non-support kFilterPlan Op in online serving /root/OpenMLDB/hybridse/src/testing/engine_test_base.cc:427: Failure Expected equality of these values: sql_case_.expect().success_ Which is: true status.isOK() Which is: false@aceforeverd Is there anything I have not done? Or no need to test
rlike/regexp_likeinWHEREclause.
The test case for
likeinsideast_node_converter_testis related to keywordshow. Additional modification is needed in zetasql to add test case forrlike. So should I propose changes to zetasql again?
Not fully understand. Detail explanation ?
The test case for
likeinsideast_node_converter_testis related to keywordshow. Additional modification is needed in zetasql to add test case forrlike. So should I propose changes to zetasql again?Not fully understand. Detail explanation ?
It's kind of complicated to make rlike behave exactly as same as like in ast_node_converter_test, and several files need to be modified in zetasql.
Here are the details. https://github.com/4paradigm/zetasql/pull/42.
For the modifications needed in OpenMLDB. Here https://github.com/jiang1997/OpenMLDB/commit/410f8a75020b7020c4f3be0c2f873b257ba5520e.
Can you accept this solution?
I found that with this line, this pr can pass all test cases invoked by toydb_run_engine, but can not by bash steps/ut.sh sql_sdk_test 0 (that is run_sql_sdk_test in Actions).
When change this line to - [3, 1590115420002, "The Lord of the Rings\\nJ. R. R. Tolkien"] (add extra back slash), this pr can pass all test cases invoked by bash steps/ut.sh sql_sdk_test 0, but can not by toydb_run_engine.
There must to two different ways to parse a string with escape characters. How can I cope with this? Leave out this problem-causing test case or modify toydb_run_engine?
The test case for
likeinsideast_node_converter_testis related to keywordshow. Additional modification is needed in zetasql to add test case forrlike. So should I propose changes to zetasql again?Not fully understand. Detail explanation ?
It's kind of complicated to make
rlikebehave exactly as same aslikeinast_node_converter_test, and several files need to be modified in zetasql. Here are the details. 4paradigm/zetasql#42. For the modifications needed inOpenMLDB. Here jiang1997@410f8a7. Can you accept this solution?
No need though, show stmt only have like clause or where clause, see https://dev.mysql.com/doc/refman/8.0/en/show.html, put rlike expr insidewhere clause. Anyway, we currently do not implement the like clause or where clause