OpenMLDB icon indicating copy to clipboard operation
OpenMLDB copied to clipboard

feat: support the SQL RLIKE expression

Open jiang1997 opened this issue 3 years ago • 33 comments

close #862

Development

  • [x] SQL syntax https://github.com/4paradigm/zetasql/pull/41
  • [x] extend BinaryExpr to support RLIKE type
  • [x] add regexp_like builtin function
  • [x] Codegen: deal with RLIKE BinaryExpr

Test

  • [x] ~~logic plan test in ast_node_converter_test to verify correctly plan generated for rlike expr (discuss needed)~~
  • [x] unit test for new functions in udf.cc -> udf_test.cc
  • [x] udf_ir_builder_test for the registered function regexp_like udf functions
  • [x] expr_ir_builder_test for the supported rlike expression
  • [ ] and end2end, a full sql example to test correctly for both rlike expression and regexp_like function 99% (discuss needed)

jiang1997 avatar Jul 17 '22 17:07 jiang1997

Codecov Report

Merging #2187 (e7f1851) into main (f368158) will increase coverage by 0.01%. The diff coverage is 90.90%.

@@             Coverage Diff              @@
##               main    #2187      +/-   ##
============================================
+ Coverage     75.87%   75.88%   +0.01%     
  Complexity      368      368              
============================================
  Files           629      629              
  Lines        119285   119495     +210     
  Branches       1022     1022              
============================================
+ Hits          90509    90684     +175     
- Misses        28560    28595      +35     
  Partials        216      216              
Impacted Files Coverage Δ
hybridse/src/codegen/expr_ir_builder.h 100.00% <ø> (ø)
hybridse/src/udf/udf.h 100.00% <ø> (ø)
hybridse/src/codegen/expr_ir_builder.cc 78.30% <79.54%> (+0.08%) :arrow_up:
hybridse/src/udf/udf.cc 85.64% <84.61%> (-0.06%) :arrow_down:
hybridse/src/udf/udf_test.cc 97.63% <92.98%> (-0.43%) :arrow_down:
hybridse/include/node/sql_node.h 83.50% <100.00%> (+0.03%) :arrow_up:
hybridse/src/codegen/expr_ir_builder_test.cc 98.27% <100.00%> (+0.03%) :arrow_up:
hybridse/src/codegen/udf_ir_builder_test.cc 99.52% <100.00%> (+0.01%) :arrow_up:
hybridse/src/node/expr_node.cc 82.32% <100.00%> (+0.36%) :arrow_up:
hybridse/src/planv2/ast_node_converter.cc 94.43% <100.00%> (+0.01%) :arrow_up:
... and 23 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

codecov[bot] avatar Jul 17 '22 17:07 codecov[bot]

SDK Test Report

  79 files    79 suites   6m 10s :stopwatch: 167 tests 165 :heavy_check_mark: 2 :zzz: 0 :x: 209 runs  207 :heavy_check_mark: 2 :zzz: 0 :x:

Results for commit e7f18516.

:recycle: This comment has been updated with latest results.

github-actions[bot] avatar Jul 17 '22 18:07 github-actions[bot]

HybridSE Linux Test Report

19 669 tests   19 667 :heavy_check_mark:  5m 27s :stopwatch:      237 suites           2 :zzz:        67 files             0 :x:

Results for commit e7f18516.

:recycle: This comment has been updated with latest results.

github-actions[bot] avatar Jul 17 '22 18:07 github-actions[bot]

HybridSE Mac Test Report

19 669 tests   19 667 :heavy_check_mark:  7m 14s :stopwatch:      237 suites           2 :zzz:        67 files             0 :x:

Results for commit e7f18516.

:recycle: This comment has been updated with latest results.

github-actions[bot] avatar Jul 17 '22 18:07 github-actions[bot]

Linux Test Report

       55 files       200 suites   52m 20s :stopwatch:   9 334 tests   9 327 :heavy_check_mark: 7 :zzz: 0 :x: 13 725 runs  13 718 :heavy_check_mark: 7 :zzz: 0 :x:

Results for commit df1007d2.

:recycle: This comment has been updated with latest results.

github-actions[bot] avatar Jul 17 '22 18:07 github-actions[bot]

@jiang1997 thank you for your contribution. It seems you are still working on it, we suggest you convert it to draft first? Let us know when you are ready to be reviewed.

lumianph avatar Jul 18 '22 01:07 lumianph

RLIKE keyword is not supported by sql parser yet, you can just working on the regexp_like udf function here. Feel free to discuss

aceforeverd avatar Jul 18 '22 01:07 aceforeverd

RLIKE keyword is not supported by sql parser yet, you can just working on the regexp_like udf function here. Feel free to discuss

Thanks. Actually, I'm also interested in the parser part.

jiang1997 avatar Jul 18 '22 02:07 jiang1997

@jiang1997 thank you for your contribution. It seems you are still working on it, we suggest you convert it to draft first? Let us know when you are ready to be reviewed.

Got it.

jiang1997 avatar Jul 18 '22 02:07 jiang1997

RLIKE keyword is not supported by sql parser yet, you can just working on the regexp_like udf function here. Feel free to discuss

Thanks. Actually, I'm also interested in the parser part.

If u like afterwards, have a reference for ilike: https://github.com/4paradigm/zetasql/pull/30. Probably need a few time to understand how build :)

aceforeverd avatar Jul 18 '22 03:07 aceforeverd

@aceforeverd

RLIKE keyword is not supported by sql parser yet, you can just working on the regexp_like udf function here. Feel free to discuss

Thanks. Actually, I'm also interested in the parser part.

If u like afterwards, have a reference for ilike: 4paradigm/zetasql#30. Probably need a few time to understand how build :)

This will help a lot.

jiang1997 avatar Jul 18 '22 07:07 jiang1997

I encountered an issue. With the regex of STL, there is a feature that can't be implemented in the normal way. refers to https://docs.snowflake.com/en/sql-reference/functions-regexp.html#label-regexp-parameters-argument

Parameter Effect
c Enables case-sensitive matching.
i Enables case-insensitive matching.
m Enables multi-line mode (i.e. meta-characters ^ and $ mark the beginning and end of any line of the subject). By default, multi-line mode is disabled (i.e. ^ and $ mark the beginning and end of the entire subject).
e Extracts sub-matches; applies only to REGEXP_INSTR, REGEXP_SUBSTR, REGEXP_SUBSTR_ALL, and the aliases for these functions.
s Enables the POSIX wildcard character . to match \n. By default, wildcard character matching is disabled.

This says by default the character '.' of a pattern can't match '\n'. But when using regex of STL, '.' always matches '\n'. Considering it's just the first version/draft, could we ignore the flag s?

jiang1997 avatar Jul 20 '22 17:07 jiang1997

I encountered an issue. With the regex of STL, there is a feature that can't be implemented in the normal way. refers to https://docs.snowflake.com/en/sql-reference/functions-regexp.html#label-regexp-parameters-argument

Parameter Effect c Enables case-sensitive matching. i Enables case-insensitive matching. m Enables multi-line mode (i.e. meta-characters ^ and $ mark the beginning and end of any line of the subject). By default, multi-line mode is disabled (i.e. ^ and $ mark the beginning and end of the entire subject). e Extracts sub-matches; applies only to REGEXP_INSTR, REGEXP_SUBSTR, REGEXP_SUBSTR_ALL, and the aliases for these functions. s Enables the POSIX wildcard character . to match \n. By default, wildcard character matching is disabled. This says by default the character '.' of a pattern can't match '\n'. But when using regex of STL, '.' always matches '\n'. Considering it's just the first version/draft, could we ignore the flag s?

posix regex do not match, stl regex construct regex default with ECMAScript, maybe that's the difference ? see https://en.cppreference.com/w/cpp/regex/basic_regex

aceforeverd avatar Jul 21 '22 10:07 aceforeverd

@aceforeverd Yea, I thought so too at first. Below description is quoted from this link https://en.cppreference.com/w/cpp/regex/basic_regex.

At most one grammar option must be chosen out of ECMAScript, basic, extended, awk, grep, egrep. 
If no grammar is chosen, ECMAScript is assumed to be selected. 

I actually tried all of these options and none of them complies with the behaviour we want. Here is the code used to test.

#include <iostream>
#include <string>
#include <regex>

int main()
{
    std::string s = "a\nb";
    std::regex self_regex("a.b", std::regex_constants::basic);
    if (std::regex_match(s, self_regex)) {
        std::cout << s << endl;
    }
    return 0;
}

Before this problem is pinned down, I will do other work first.

jiang1997 avatar Jul 21 '22 10:07 jiang1997

@aceforeverd Yea, I thought so too at first. Below description is quoted from this link https://en.cppreference.com/w/cpp/regex/basic_regex.

At most one grammar option must be chosen out of ECMAScript, basic, extended, awk, grep, egrep. 
If no grammar is chosen, ECMAScript is assumed to be selected. 

I actually tried all of these options and none of them complies with the behaviour we want. Here is the code used to test.

#include <iostream>
#include <string>
#include <regex>

int main()
{
    std::string s = "a\nb";
    std::regex self_regex("a.b", std::regex_constants::basic);
    if (std::regex_match(s, self_regex)) {
        std::cout << s << endl;
    }
    return 0;
}

Before this problem is pinned down, I will do other work first.

How about ECMAScript ? In my mac, it will not match newline

int main() {
  std::string s = "a\nb";
  std::regex self_regex("a.b", std::regex_constants::ECMAScript);
  if (std::regex_match(s, self_regex)) {
    std::cout << "matched" << std::endl;
    std::cout << s << std::endl;
  } else {
    std::cout << "not match" << std::endl;
  }
  return 0;
}

posix regex 's <period> match <newline>, both BRE and ERE, see

  • https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03:~:text=as%20an%20anchor.-,9.3.4,-Periods%20in%20BREs
  • https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_04:~:text=as%20an%20anchor.-,9.4.4%20Periods%20in%20EREs,-A%20%3Cperiod%3E%20(

aceforeverd avatar Jul 22 '22 05:07 aceforeverd

e flag can ignored here ?

And do you think if any other thirdparty lib works here, e.g google's re2 lib has the i, m, s` flags that seems same here.

I was told std::regex is too complex 🗿 (from internet

aceforeverd avatar Jul 22 '22 05:07 aceforeverd

@zhanghaohit @vagetablechicken any experienced thoughts ?

aceforeverd avatar Jul 22 '22 05:07 aceforeverd

@aceforeverd Yea, I thought so too at first. Below description is quoted from this link https://en.cppreference.com/w/cpp/regex/basic_regex.

At most one grammar option must be chosen out of ECMAScript, basic, extended, awk, grep, egrep. 
If no grammar is chosen, ECMAScript is assumed to be selected. 

I actually tried all of these options and none of them complies with the behaviour we want. Here is the code used to test.

#include <iostream>
#include <string>
#include <regex>

int main()
{
    std::string s = "a\nb";
    std::regex self_regex("a.b", std::regex_constants::basic);
    if (std::regex_match(s, self_regex)) {
        std::cout << s << endl;
    }
    return 0;
}

Before this problem is pinned down, I will do other work first.

How about ECMAScript ? In my mac, it will not match newline

int main() {
  std::string s = "a\nb";
  std::regex self_regex("a.b", std::regex_constants::ECMAScript);
  if (std::regex_match(s, self_regex)) {
    std::cout << "matched" << std::endl;
    std::cout << s << std::endl;
  } else {
    std::cout << "not match" << std::endl;
  }
  return 0;
}

posix regex 's <period> match <newline>, both BRE and ERE, see

  • https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03:~:text=as%20an%20anchor.-,9.3.4,-Periods%20in%20BREs
  • https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_04:~:text=as%20an%20anchor.-,9.4.4%20Periods%20in%20EREs,-A%20%3Cperiod%3E%20(

It's my fault for ignoring ECMAScript. Seems 'e' does not involve in rlike. It's ok to leave it out. I will try google's re2 lib later.

jiang1997 avatar Jul 22 '22 11:07 jiang1997

Great. Note re2 already included in thirdparty.

aceforeverd avatar Jul 22 '22 12:07 aceforeverd

Now it can be built with https://github.com/4paradigm/zetasql/pull/41.

jiang1997 avatar Jul 24 '22 04:07 jiang1997

@jiang1997 great job so far! Yet in the meantime, the feature need enough tests to ensure correctness. With the diffs so far, you can add at least:

  • logic plan test in ast_node_converter_test to verify correctly plan generated for rlike expr
  • unit test for new functions in udf.cc -> udf_test.cc
  • udf_ir_builder_test for the registered function regexp_like udf functions
  • expr_ir_builder_test for the supported rlike expression
  • and end2end, a full sql example to test correctly for both rlike expression and regexp_like function\

As usual, you can take the like and ilike pr as a reference. Also it's good practice to cover every piece of changed code, with the help of coverage report :)

aceforeverd avatar Jul 29 '22 16:07 aceforeverd

In the other hand, it may not necessary to test detail rules of regex expression as it just too much, simple a few most important flags and some basic regex match should be fine.

aceforeverd avatar Jul 29 '22 16:07 aceforeverd

@jiang1997 great job so far! Yet in the meantime, the feature need enough tests to ensure correctness. With the diffs so far, you can add at least:

  • logic plan test in ast_node_converter_test to verify correctly plan generated for rlike expr
  • unit test for new functions in udf.cc -> udf_test.cc
  • udf_ir_builder_test for the registered function regexp_like udf functions
  • expr_ir_builder_test for the supported rlike expression
  • and end2end, a full sql example to test correctly for both rlike expression and regexp_like function\

As usual, you can take the like and ilike pr as a reference. Also it's good practice to cover every piece of changed code, with the help of coverage report :)

Thanks for your pointers.

jiang1997 avatar Jul 31 '22 10:07 jiang1997

The test case for like inside ast_node_converter_test is related to keyword show. Additional modification is needed in zetasql to add test case for rlike. So should I propose changes to zetasql again?

jiang1997 avatar Aug 01 '22 18:08 jiang1997

I found that for the 3 modes batch/request/batch request , when test rlike/regexp_like or like/like_match used in SQL ON clause, everything works fine. But when test in SQL WHERE clause, rlike/regexp_like and like/like_match both failed in request/batch request mode.

Here is how I test in request mode and use like_match inside WHERE clause. First, create like.yaml as below.

db: test_zw
debugs: []
sqlDialect: ["HybridSQL"]
cases:
  - id: 5
    desc: udf regexp_like with flags
    inputs:
      - columns: ["id int", "ts timestamp", "col2 string"]
        indexs: ["index1:id:ts"]
        rows:
          - [1, 1590115420000, "aa"]
          - [2, 1590115420000, "bb"]
          - [3, 1590115420000, "cc"]
    sql: |
      select id,
      from {0}
      where like_match(col2, "aa");
    expect:
      columns: ["id int"]
      order: id
      rows:
            - [1]

Then run

./build/hybridse/examples/toydb/src/toydb_run_engine --runner_mode "batch request" --yaml_path like.yaml

Get output as below.

WARNING: Logging before InitGoogleLogging() is written to STDERR
I0802 08:15:53.109465 26294 toydb_engine_test_base.cc:71] Init Toy DB Engine & Catalog
I0802 08:15:53.109545 26294 engine_test_base.cc:378] Compile SQL:
select id,
from auto_t0
where like_match(col2, "aa");
I0802 08:15:53.109673 26294 default_udf_library.cc:46] Creating DefaultUdfLibrary
W0802 08:15:53.116494 26294 sql_compiler.cc:329] Fail create sql plan: Non-support kFilterPlan Op in online serving
    (At /root/OpenMLDB/hybridse/src/planv2/planner_v2.cc:44)
    (At /root/OpenMLDB/hybridse/src/plan/planner.cc:663)
    (At /root/OpenMLDB/hybridse/src/plan/planner.cc:503)
    (At /root/OpenMLDB/hybridse/src/plan/planner.cc:503)
    (At /root/OpenMLDB/hybridse/src/plan/planner.cc:508)
    (Caused by) Non-support kFilterPlan Op in online serving
I0802 08:15:53.116533 26294 engine_test_base.cc:398] SQL Compile take 6.98 milliseconds
I0802 08:15:53.116550 26294 engine_test_base.cc:401] Non-support kFilterPlan Op in online serving
    (At /root/OpenMLDB/hybridse/src/planv2/planner_v2.cc:44)
    (At /root/OpenMLDB/hybridse/src/plan/planner.cc:663)
    (At /root/OpenMLDB/hybridse/src/plan/planner.cc:503)
    (At /root/OpenMLDB/hybridse/src/plan/planner.cc:503)
    (At /root/OpenMLDB/hybridse/src/plan/planner.cc:508)
    (Caused by) Non-support kFilterPlan Op in online serving
/root/OpenMLDB/hybridse/src/testing/engine_test_base.cc:427: Failure
Expected equality of these values:
  sql_case_.expect().success_
    Which is: true
  status.isOK()
    Which is: false

@aceforeverd Is there anything I have not done? Or no need to test rlike/regexp_like in WHERE clause.

jiang1997 avatar Aug 02 '22 08:08 jiang1997

Yes, no need. It's expected as a compile time error, where clause is not supported in online serving mode ( request mode and batch request mode).

I found that for the 3 modes batch/request/batch request , when test rlike/regexp_like or like/like_match used in SQL ON clause, everything works fine. But when test in SQL WHERE clause, rlike/regexp_like and like/like_match both failed in request/batch request mode.

Here is how I test in request mode and use like_match inside WHERE clause. First, create like.yaml as below.

db: test_zw
debugs: []
sqlDialect: ["HybridSQL"]
cases:
  - id: 5
    desc: udf regexp_like with flags
    inputs:
      - columns: ["id int", "ts timestamp", "col2 string"]
        indexs: ["index1:id:ts"]
        rows:
          - [1, 1590115420000, "aa"]
          - [2, 1590115420000, "bb"]
          - [3, 1590115420000, "cc"]
    sql: |
      select id,
      from {0}
      where like_match(col2, "aa");
    expect:
      columns: ["id int"]
      order: id
      rows:
            - [1]

Then run

./build/hybridse/examples/toydb/src/toydb_run_engine --runner_mode "batch request" --yaml_path like.yaml

Get output as below.

WARNING: Logging before InitGoogleLogging() is written to STDERR
I0802 08:15:53.109465 26294 toydb_engine_test_base.cc:71] Init Toy DB Engine & Catalog
I0802 08:15:53.109545 26294 engine_test_base.cc:378] Compile SQL:
select id,
from auto_t0
where like_match(col2, "aa");
I0802 08:15:53.109673 26294 default_udf_library.cc:46] Creating DefaultUdfLibrary
W0802 08:15:53.116494 26294 sql_compiler.cc:329] Fail create sql plan: Non-support kFilterPlan Op in online serving
    (At /root/OpenMLDB/hybridse/src/planv2/planner_v2.cc:44)
    (At /root/OpenMLDB/hybridse/src/plan/planner.cc:663)
    (At /root/OpenMLDB/hybridse/src/plan/planner.cc:503)
    (At /root/OpenMLDB/hybridse/src/plan/planner.cc:503)
    (At /root/OpenMLDB/hybridse/src/plan/planner.cc:508)
    (Caused by) Non-support kFilterPlan Op in online serving
I0802 08:15:53.116533 26294 engine_test_base.cc:398] SQL Compile take 6.98 milliseconds
I0802 08:15:53.116550 26294 engine_test_base.cc:401] Non-support kFilterPlan Op in online serving
    (At /root/OpenMLDB/hybridse/src/planv2/planner_v2.cc:44)
    (At /root/OpenMLDB/hybridse/src/plan/planner.cc:663)
    (At /root/OpenMLDB/hybridse/src/plan/planner.cc:503)
    (At /root/OpenMLDB/hybridse/src/plan/planner.cc:503)
    (At /root/OpenMLDB/hybridse/src/plan/planner.cc:508)
    (Caused by) Non-support kFilterPlan Op in online serving
/root/OpenMLDB/hybridse/src/testing/engine_test_base.cc:427: Failure
Expected equality of these values:
  sql_case_.expect().success_
    Which is: true
  status.isOK()
    Which is: false

@aceforeverd Is there anything I have not done? Or no need to test rlike/regexp_like in WHERE clause.

aceforeverd avatar Aug 02 '22 08:08 aceforeverd

The test case for like inside ast_node_converter_test is related to keyword show. Additional modification is needed in zetasql to add test case for rlike. So should I propose changes to zetasql again?

Not fully understand. Detail explanation ?

aceforeverd avatar Aug 02 '22 08:08 aceforeverd

The test case for like inside ast_node_converter_test is related to keyword show. Additional modification is needed in zetasql to add test case for rlike. So should I propose changes to zetasql again?

Not fully understand. Detail explanation ?

It's kind of complicated to make rlike behave exactly as same as like in ast_node_converter_test, and several files need to be modified in zetasql. Here are the details. https://github.com/4paradigm/zetasql/pull/42. For the modifications needed in OpenMLDB. Here https://github.com/jiang1997/OpenMLDB/commit/410f8a75020b7020c4f3be0c2f873b257ba5520e. Can you accept this solution?

jiang1997 avatar Aug 03 '22 08:08 jiang1997

I found that with this line, this pr can pass all test cases invoked by toydb_run_engine, but can not by bash steps/ut.sh sql_sdk_test 0 (that is run_sql_sdk_test in Actions). When change this line to - [3, 1590115420002, "The Lord of the Rings\\nJ. R. R. Tolkien"] (add extra back slash), this pr can pass all test cases invoked by bash steps/ut.sh sql_sdk_test 0, but can not by toydb_run_engine.

There must to two different ways to parse a string with escape characters. How can I cope with this? Leave out this problem-causing test case or modify toydb_run_engine?

jiang1997 avatar Aug 04 '22 06:08 jiang1997

The test case for like inside ast_node_converter_test is related to keyword show. Additional modification is needed in zetasql to add test case for rlike. So should I propose changes to zetasql again?

Not fully understand. Detail explanation ?

It's kind of complicated to make rlike behave exactly as same as like in ast_node_converter_test, and several files need to be modified in zetasql. Here are the details. 4paradigm/zetasql#42. For the modifications needed in OpenMLDB. Here jiang1997@410f8a7. Can you accept this solution?

No need though, show stmt only have like clause or where clause, see https://dev.mysql.com/doc/refman/8.0/en/show.html, put rlike expr insidewhere clause. Anyway, we currently do not implement the like clause or where clause

aceforeverd avatar Aug 05 '22 08:08 aceforeverd