TRegExpr FindRepeatead and Unicode / may break OP

FindRepeatead and Unicode / may break OP_STAR/PLUS/...

Open User4martin opened this issue 1 year ago • 2 comments

I have not further analysed this...

FindRepeated (for unicode) calls IncUnicode2 which may (for surrogates) increment by 2. For the OPs that can match a surrogate this will be a problem.

OP_STAR/.... in MatchPrim will iterate the returned range in steps of one ReChar (codeunit): regInput := save + no;

Also the result of FindRepeated may be the

codeunits for OP_ANY (counting a surrogate as 2)
"Chars"/full-codepoints for any of the OP_NOT... (counting a surrogate as 1)

One way I can think of (.+).

if the last char in the text is a surrogate, then the capture matches half a surrogate
if the text is exactly one char, and that is a surrogate, then it incorrectly matches. It needs 2 chars, and takes each half of the surrogate as a full char.

OP_STAR goes back half the surrogate, and then OP_ANY does not check that it matches the 2nd part of a surrogate

This may be fixable (but I have not tested)

OP_STAR... in MatchPrim must check regInput := save + no; points to the 2nd part of a surrogate
FindRepeated always most return the amount of codeunits (ReChars) / always counting a surrogate as 2.

Jan 07 '24 13:01 User4martin

On what case (RE, text) does engine fail currently?

Jan 07 '24 13:01 Alexey-T

I only deducted from code review. But https://www.compart.com/de/unicode/U+10000

IsNotMatching('surrogat', '.+.', #$D800#$DC00); fails (it will match).

This is one char. so the .+ should entirely consume it, and leave nothing for the extra ..

Btw, same issue with combining codepoints.

on https://regex101.com/ not all regex handle this either (Python, GoLang, Java seem to do)

Jan 07 '24 13:01 User4martin

TRegExpr TRegExpr copied to clipboard

FindRepeatead and Unicode / may break OP_STAR/PLUS/...

TRegExpr
TRegExpr copied to clipboard