libfsm icon indicating copy to clipboard operation
libfsm copied to clipboard

Fixes for rewrite_repeat

Open hvdijk opened this issue 5 years ago • 9 comments

rewrite_repeat, used for rewriting nested repeats contained a comment that said repeats can only be combined if the range of the result is not more than the sum of the ranges of the inputs, but the assertion actually tested the opposite: it tested that the range of the result is more than the sum of the ranges of the inputs. This assertion does not universally hold either way: a{2}{2} can be rewritten to a{4}, but the range as calculated of the latter (0) is equal to the sum of the ranges of the inputs (also 0).

Separate from the assert, the rewrite was not performed in some cases where it is valid to do so. It is valid whenever the inner repeat has a lower bound of 0 or 1, whenever the inner repeat has no upper bound and the outer repeat has a positive lower bound, and whenever the outer range has a lower bound equal to the upper bound. The last case was missing. An example is a{2,3}{2}, which would previously be preserved in that form, but is now rewritten as a{4,6}.

hvdijk avatar Sep 10 '20 13:09 hvdijk

Thinking about this a bit more, it occurs to me that there are more cases that weren't handled, and aren't handled by my PR. a{2,3}{2,3} can be rewritten to a{4,9}. I am not sure just yet how to improve the check to handle this.

hvdijk avatar Sep 10 '20 15:09 hvdijk

Thinking about this a bit more, it occurs to me that there are more cases that weren't handled, and aren't handled by my PR. a{2,3}{2,3} can be rewritten to a{4,9}. I am not sure just yet how to improve the check to handle this.

I have updated the PR to handle this and included a comment in the code explaining when and why the transformation is valid.

hvdijk avatar Sep 10 '20 18:09 hvdijk

Hm thank you, I'll try to re-think some of this, too. Maybe we can find a few more situations to cover with tests, although I did think I had covered all combinations there :/

katef avatar Sep 10 '20 23:09 katef

To be pedantic, the PCRE (v2) syntax does not allow a{2,3}{2,3}. pcregrep returns an error:

% echo 'aaaaa' | pcregrep 'a{2,3}{2,3}'
pcregrep: Error in command-line regex at offset 10: nothing to repeat

The PCRE spec doesn't clearly state the grammar, so it took a bit of looking, but the section on repetition is quite clear:

Repetition is specified by quantifiers, which can follow any of the following items: a literal data character the dot metacharacter the \C escape sequence the \R escape sequence the \X escape sequence an escape such as \d or \pL that matches a single character a character class a backreference a parenthesized group (including lookaround assertions) a subroutine call (recursive or otherwise)

The question becomes... does libfsm want to support this as an addition?

sfstewman avatar Sep 11 '20 01:09 sfstewman

To be pedantic, the PCRE (v2) syntax does not allow a{2,3}{2,3}. pcregrep returns an error:

% echo 'aaaaa' | pcregrep 'a{2,3}{2,3}'
pcregrep: Error in command-line regex at offset 10: nothing to repeat

The PCRE spec doesn't clearly state the grammar, so it took a bit of looking, but the section on repetition is quite clear:

Repetition is specified by quantifiers, which can follow any of the following items: a literal data character the dot metacharacter the \C escape sequence the \R escape sequence the \X escape sequence an escape such as \d or \pL that matches a single character a character class a backreference a parenthesized group (including lookaround assertions) a subroutine call (recursive or otherwise)

The question becomes... does libfsm want to support this as an addition?

Not sure why I was thinking that this applied to PCRE. Using a{i,j}{k,l} to denote nested repeats in the AST tree makes perfect sense. Apologies.

sfstewman avatar Sep 11 '20 02:09 sfstewman

That's fine! Nothing to apologise about!!

katef avatar Sep 11 '20 05:09 katef

Yeah, that's fine, that's a good observation to make. It still needs to be handled in the AST, but perhaps all the existing a{...}{...} PCRE tests and the new one in this PR should be updated to (?:a{...}){...}, and the parser updated to not accept a{...}{...}?

PCRE does allow + and ? to follow another quantifier, but like in Perl, they behave specially, they do not mean the same thing as + or ? applied to the subexpression. In Perl/PCRE, a*+a can never match, because + makes the * greedy. In libfsm's PCRE mode, a*+a just means (?:a*)+a. Accepting the same syntax as PCRE but with different semantics seems risky.

hvdijk avatar Sep 11 '20 07:09 hvdijk

A brute force search found one edge case that would be mishandled if it hadn't already been handled by earlier code: it is invalid to rewrite a{0,0}{x,} or a{x,}{0,0} as a{0,}. This is not a problem as even if somehow this gets constructed in the AST as AST_EXPR_REPEAT nodes, the code already special cases upper bounds of zero. The problem is not actually in the logic I changed, it is in the multiplication that happens earlier:

if (i == AST_COUNT_UNBOUNDED || k == AST_COUNT_UNBOUNDED) {
	w = AST_COUNT_UNBOUNDED;
} else if (__builtin_umul_overflow(i, k, &w)) {
	return 1;
}

This effectively treats AST_COUNT_UNBOUNDED * 0 and 0 * AST_COUNT_UNBOUNDED as AST_COUNT_UNBOUNDED, but it should be treated as zero. I used the same logic in my brute force search.

This could be changed to

if (__builtin_umul_overflow(i, k, &w)) {
	if (i == AST_COUNT_UNBOUNDED || k == AST_COUNT_UNBOUNDED) {
		w = AST_COUNT_UNBOUNDED;
	} else {
		return 1;
	}
}

to fix this, but as it can never come up, and the rewritten logic is harder to understand, I do not know whether it is worth the effort.

The brute force search found no cases that my check fails to handle. I searched all possibilities for upper bounds up to 10, plus AST_COUNT_UNBOUNDED, like so:

for h in $(seq 0 10); do
for i in $(seq $h 10) ""; do
for j in $(seq 0 10); do
for k in $(seq $j 10) ""; do
    re1="a{$h,$i}{$j,$k}"
    re2="a{$((h*j)),${i:+${k:+$((i*k))}}}"
    ex=$((!(j == ${k:-99} || ${i:-99} == 99 && j > 0 || h * (j + 1) <= i * j + 1)))
    build/bin/re -pb -rpcre "$re1" >fsm1.fsm
    build/bin/re -pb -rpcre "$re2" >fsm2.fsm
    build/bin/fsm -t equal fsm1.fsm fsm2.fsm
    test $? -eq $ex || echo "$re1 $re2"
done
done
done
done

hvdijk avatar Sep 11 '20 07:09 hvdijk

Note to myself; I'm just waiting on thinking a bit about some of the AST behaviour here before I go ahead and merge this.

katef avatar Sep 17 '20 22:09 katef