Adding parenthesis to the template is required in SUBSTRING SIMILAR for proper evaluation in v4.x+ comparing to v3.x
// 1st of all: sorry for attached images, but without them it is difficult to explain the steps described below
Consider following script (save it in some file.sql):
set heading off;
select '>' || substring('/*notes*/ foo bar' similar '%#"' || '/#*[^#*]*|[^#*]*#*[^/]+#*/' || '#"%' escape '#') || '<' from rdb$database;
This is same as image (colors were added to make it easier to perceive):
On WI-V3.0.13.33806 it issues:
>/*notes*/<
On all recent 4.x ... 6.x (4.0.6.3195; 5.0.3.1647; 6.0.0.735) output is:
>/*notes< (i.e. trailing */ was lost)
Now let's change source example by grouping all between /* and */:
select '>' || substring('/*notes*/ foo bar' similar '%#"' || '/#*([^#*]*|[^#*]*#*[^/]+)#*/' || '#"%' escape '#') || '<' from rdb$database;
Image:
On both FB 3.x and 4.x result is the same (and correct): >/*notes*/<
Following is description of how to evaluate result of substring ... similar to (from SQL std, AFAIU):
<regular expression substring function> is a triadic function,
SUBSTRING, distinguished by the keywords
SIMILAR and UESCAPE. It has three parameters: a source character string,
a pattern string, and an escape
character. It returns a result string extracted from the source
character string by pattern matching using
POSIX-based regular expressions.
—Step 1: The escape character is exactly one character in length. As
indicated in Figure 1, “Operation
of <regular expression substring function>”, the escape character
precedes two instances of <double
quote> that are used to partition the pattern string into three
subpatterns (identified as R1, R2, and
R3).
—Step 2: If the source string S does not satisfy the predicate
'S' SIMILAR TO 'R1' || 'R2' || 'R3'
then the result is the null value.
—
Step 3: Otherwise, S is partitioned into two substrings S1 and S23 such
that S1 is the shortest initial
substring of S such that the following is true:
'S1' SIMILAR TO 'R1' AND
'S23' SIMILAR TO '(' || 'R2' || 'R3' || ')'
—
Step 4: Next, S23 is partitioned into two substrings S2 and S3 such that
S3 is the shortest final sub-
string such that the following is true:
'S2' SIMILAR TO 'R2' AND 'S3' SIMILAR TO 'R3'
The result of the <regular expression substring function> is S2.
I've decided to write SQL query that can show intermediate results that we will get if we had to 'manually' evaluate result of substring / similar. This query has 'root' part for assignment values to:
S(source string); it is'/*notes*/ foo bar'- pattern (it can vary);
- varying integer value to be considered as length of PREFIX (to evaluate S1); hereafter:
prefix_len - varying integer value to be considered as length of SUFFIX (to evaluate S3); hereafter:
suffix_lenFor two cases which are shown above i have two SQLs, see attach, files: - chk-substr-similar-example-2-no_parenth.sql // when pattern has NO group
- chk-substr-similar-example-2-parenthesis.sql // when pattern HAS group substring-similar_-_eval-sql.zip
Both queries has same "root" part which differs only in values of prefix_len and suffix_len:
select '/*notes*/ foo bar' as s -- <<<<<<<<< SOURCE STRING
,'%#"' || '/#*[^#*]*|[^#*]*#*[^/]+#*/' || '#"%' as pattern -- <<<<<<<<< PATTERN
,0 as prefix_len -- for evaluate S1 = left(S, prefix_len) -- <<<<<<<<< PROBE VALUE FOR EVAL. S1
,8 as suffix_len -- for evaluate S3 = right(S23, suffix_len) -- <<<<<<<<< PROBE VALUE FOR EVAL. S3
from rdb$database
vs
select '/*notes*/ foo bar' as s -- <<<<<<<<< SOURCE STRING
,'%#"' || '/#*([^#*]*|[^#*]*#*[^/]+)#*/' || '#"%' as pattern -- <<<<<<<<< PATTERN
,0 as prefix_len -- for evaluate S1 = left(S, prefix_len) -- <<<<<<<<< PROBE VALUE FOR EVAL. S1
,8 as suffix_len -- for evaluate S3 = right(S23, suffix_len) -- <<<<<<<<< PROBE VALUE FOR EVAL. S3
from rdb$database
(only lines with "prefix_len" and "suffix_len" must be changed in order to see different final results)
Note on their difference as image:
If we start to change prefix_len then we will see that result of
'S1' SIMILAR TO 'R1' AND 'S23' SIMILAR TO '(' || 'R2' || 'R3' || ')'
will become True for minimal possible value of prefix_len, i.e. for 0.
This is so because our pattern starts with '%' and empty string DOES match it.
So, we have to find another value: suffix_len -- such that:
if we use as delimiter to split S23 on S2 and S3 then S3 will be the shortest final sub-string
such that the following is true: 'S2' SIMILAR TO 'R2' AND 'S3' SIMILAR TO 'R3'
I've used for suffix_len following values: 0,1,2,3,4,5,6,7 and 8 (i.e. ran query nine times).
For all values from 0,1,2 output of query will be:
(and this is expected)
The same for values 3...7.
For suffix_len = 8 we will get FIRST TIME "true" as result of evaluation "s2 similar to r2":
Following can illustrate how source string (S) was "splitted" at this point:
(green digits from 0 to 7 are eight characters that help to see 'S3' content:
foo bar - note! - with one leading space).
The key point (as i understood algorithm): here parser HAS TO STOP because we need to find S3 or MINIMAL length (and we find it just now: it is 8). And FB 3.x does it. But FB 4.x - no.
Following are results if we change suffix_len to values: 9, 10 and 11:
So, parser in FB 4.x for some reason IGNORES first successful position where S3 was found (i.e. when suffix_len was set to 8), continued with suffix_len = 9 (with encountering fail for rule "Step-4"), then continued for one more iteration (with suffix_len = 10) - and only at this point it 'decided' to stop.
NOTE: this occurred only when pattern has no grouping (i.e. without parenthesis).
I don't know whether such behaviour in FB 4.x is correct or no. Yes, workaround was found (grouping) but i still in doubt should it be so or no.
PS Here are log files in attachment are logs for suffix_len = 8, for FB 3.x and 4.x, when we do use grouping ("parenthesis") and do not: substring-similar_-_eval-logs.zip