firebird icon indicating copy to clipboard operation
firebird copied to clipboard

Adding parenthesis to the template is required in SUBSTRING SIMILAR for proper evaluation in v4.x+ comparing to v3.x

Open pavel-zotov opened this issue 8 months ago • 0 comments

// 1st of all: sorry for attached images, but without them it is difficult to explain the steps described below

Consider following script (save it in some file.sql):

set heading off;
select '>' || substring('/*notes*/ foo bar' similar '%#"' || '/#*[^#*]*|[^#*]*#*[^/]+#*/' || '#"%' escape '#') || '<' from rdb$database;

This is same as image (colors were added to make it easier to perceive): Image

On WI-V3.0.13.33806 it issues: >/*notes*/< On all recent 4.x ... 6.x (4.0.6.3195; 5.0.3.1647; 6.0.0.735) output is: >/*notes< (i.e. trailing */ was lost)

Now let's change source example by grouping all between /* and */: select '>' || substring('/*notes*/ foo bar' similar '%#"' || '/#*([^#*]*|[^#*]*#*[^/]+)#*/' || '#"%' escape '#') || '<' from rdb$database; Image: Image

On both FB 3.x and 4.x result is the same (and correct): >/*notes*/<

Following is description of how to evaluate result of substring ... similar to (from SQL std, AFAIU):

<regular expression substring function> is a triadic function,
SUBSTRING, distinguished by the keywords
SIMILAR and UESCAPE. It has three parameters: a source character string,
a pattern string, and an escape
character. It returns a result string extracted from the source
character string by pattern matching using
POSIX-based regular expressions.
—Step 1: The escape character is exactly one character in length. As
indicated in Figure 1, “Operation
of <regular expression substring function>”, the escape character
precedes two instances of <double
quote> that are used to partition the pattern string into three
subpatterns (identified as R1, R2, and
R3).
—Step 2: If the source string S does not satisfy the predicate
'S' SIMILAR TO 'R1' || 'R2' || 'R3'
then the result is the null value.
—
Step 3: Otherwise, S is partitioned into two substrings S1 and S23 such
that S1 is the shortest initial
substring of S such that the following is true:
'S1' SIMILAR TO 'R1' AND
'S23' SIMILAR TO '(' || 'R2' || 'R3' || ')'
—
Step 4: Next, S23 is partitioned into two substrings S2 and S3 such that
S3 is the shortest final sub-
string such that the following is true:
'S2' SIMILAR TO 'R2' AND 'S3' SIMILAR TO 'R3'
The result of the <regular expression substring function> is S2.

I've decided to write SQL query that can show intermediate results that we will get if we had to 'manually' evaluate result of substring / similar. This query has 'root' part for assignment values to:

  • S (source string); it is '/*notes*/ foo bar'
  • pattern (it can vary);
  • varying integer value to be considered as length of PREFIX (to evaluate S1); hereafter: prefix_len
  • varying integer value to be considered as length of SUFFIX (to evaluate S3); hereafter: suffix_len For two cases which are shown above i have two SQLs, see attach, files:
  • chk-substr-similar-example-2-no_parenth.sql // when pattern has NO group
  • chk-substr-similar-example-2-parenthesis.sql // when pattern HAS group substring-similar_-_eval-sql.zip

Both queries has same "root" part which differs only in values of prefix_len and suffix_len:

   select '/*notes*/ foo bar' as s                                   -- <<<<<<<<< SOURCE STRING
       ,'%#"' || '/#*[^#*]*|[^#*]*#*[^/]+#*/' || '#"%' as pattern    -- <<<<<<<<< PATTERN
       ,0 as prefix_len -- for evaluate S1 = left(S, prefix_len)     -- <<<<<<<<< PROBE VALUE FOR EVAL. S1
       ,8  as suffix_len -- for evaluate S3 = right(S23, suffix_len) -- <<<<<<<<< PROBE VALUE FOR EVAL. S3
   from rdb$database

vs

   select '/*notes*/ foo bar' as s                                   -- <<<<<<<<< SOURCE STRING
       ,'%#"' || '/#*([^#*]*|[^#*]*#*[^/]+)#*/' || '#"%' as pattern  -- <<<<<<<<< PATTERN
       ,0 as prefix_len -- for evaluate S1 = left(S, prefix_len)     -- <<<<<<<<< PROBE VALUE FOR EVAL. S1
       ,8  as suffix_len -- for evaluate S3 = right(S23, suffix_len) -- <<<<<<<<< PROBE VALUE FOR EVAL. S3
   from rdb$database

(only lines with "prefix_len" and "suffix_len" must be changed in order to see different final results) Note on their difference as image: Image

If we start to change prefix_len then we will see that result of 'S1' SIMILAR TO 'R1' AND 'S23' SIMILAR TO '(' || 'R2' || 'R3' || ')' will become True for minimal possible value of prefix_len, i.e. for 0. This is so because our pattern starts with '%' and empty string DOES match it.

So, we have to find another value: suffix_len -- such that: if we use as delimiter to split S23 on S2 and S3 then S3 will be the shortest final sub-string such that the following is true: 'S2' SIMILAR TO 'R2' AND 'S3' SIMILAR TO 'R3'

I've used for suffix_len following values: 0,1,2,3,4,5,6,7 and 8 (i.e. ran query nine times). For all values from 0,1,2 output of query will be: Image (and this is expected) The same for values 3...7.

For suffix_len = 8 we will get FIRST TIME "true" as result of evaluation "s2 similar to r2": Image

Following can illustrate how source string (S) was "splitted" at this point: Image (green digits from 0 to 7 are eight characters that help to see 'S3' content: foo bar - note! - with one leading space).

The key point (as i understood algorithm): here parser HAS TO STOP because we need to find S3 or MINIMAL length (and we find it just now: it is 8). And FB 3.x does it. But FB 4.x - no.

Following are results if we change suffix_len to values: 9, 10 and 11: Image

So, parser in FB 4.x for some reason IGNORES first successful position where S3 was found (i.e. when suffix_len was set to 8), continued with suffix_len = 9 (with encountering fail for rule "Step-4"), then continued for one more iteration (with suffix_len = 10) - and only at this point it 'decided' to stop.

NOTE: this occurred only when pattern has no grouping (i.e. without parenthesis).

I don't know whether such behaviour in FB 4.x is correct or no. Yes, workaround was found (grouping) but i still in doubt should it be so or no.

PS Here are log files in attachment are logs for suffix_len = 8, for FB 3.x and 4.x, when we do use grouping ("parenthesis") and do not: substring-similar_-_eval-logs.zip

pavel-zotov avatar Apr 16 '25 19:04 pavel-zotov