[Enhancement] Support Illegal Character in Regex Name Group
Description
Currently, regex based extraction commands which are using java regex library has a limitation of including special characters such as (-, _ ,@) in the named captured group for creating a new column in the result site. Here are some related issues:
- https://github.com/opensearch-project/sql/issues/3944
- https://github.com/opensearch-project/sql/issues/4467
PR: https://github.com/opensearch-project/sql/pull/4434 enhanced the experience of unify the error handling of this for both parse and rex commands. Here is the current behavior
curl -X POST "localhost:9200/_plugins/_ppl" -H 'Content-Type: application/json' -d'{
"query": "source=accounts | rex field=email \"(?<username>[^@]+)@(?<domain_name>[^.]+)\" | fields email, username, domain_name | head 3"
}' | jq
{
"error": {
"reason": "Invalid Query",
"details": "Invalid capture group name 'domain_name'. Java regex group names must start with a letter and contain only letters and digits.",
"type": "IllegalArgumentException"
},
"status": 400
}
However, Coming from the https://github.com/opensearch-project/sql/pull/4434#issuecomment-3399182076 @ykmr1224 pointed we should be able to support the invalid characters by rewriting regex and map extracted values back to original name.
Expected Behavior
e.g.: (?<user_name>.+)(?<username>.+)(?<username1>.+) => (?<username2>.+)(?<username>(?<username1>.+), mapping = {username2 => user_name, username => username, username1 => username1}
Exit Criteria
- Proper testing cover all the edge cases of re-writing - reference to https://github.com/opensearch-project/sql/pull/4434#issuecomment-3395550765
- Double check the debugging flows (e.g.
/_explainand server log) make sure this will not be lead into any confusions - Performance testing to make sure no notable performance downgrade
- Update the documentations if the behavior changed (both
parseandrex)