TRegExpr
TRegExpr copied to clipboard
FindRepeatead and Unicode / may break OP_STAR/PLUS/...
I have not further analysed this...
FindRepeated
(for unicode) calls IncUnicode2
which may (for surrogates) increment by 2. For the OPs that can match a surrogate this will be a problem.
OP_STAR/.... in MatchPrim will iterate the returned range in steps of one ReChar (codeunit): regInput := save + no;
Also the result of FindRepeated
may be the
- codeunits for OP_ANY (counting a surrogate as 2)
- "Chars"/full-codepoints for any of the OP_NOT... (counting a surrogate as 1)
One way I can think of (.+).
- if the last char in the text is a surrogate, then the capture matches half a surrogate
- if the text is exactly one char, and that is a surrogate, then it incorrectly matches. It needs 2 chars, and takes each half of the surrogate as a full char.
OP_STAR goes back half the surrogate, and then OP_ANY does not check that it matches the 2nd part of a surrogate
This may be fixable (but I have not tested)
- OP_STAR... in MatchPrim must check
regInput := save + no;
points to the 2nd part of a surrogate -
FindRepeated
always most return the amount of codeunits (ReChars) / always counting a surrogate as 2.
On what case (RE, text) does engine fail currently?
I only deducted from code review. But https://www.compart.com/de/unicode/U+10000
IsNotMatching('surrogat', '.+.', #$D800#$DC00);
fails (it will match).
This is one char. so the .+
should entirely consume it, and leave nothing for the extra .
.
Btw, same issue with combining codepoints.
on https://regex101.com/ not all regex handle this either (Python, GoLang, Java seem to do)