TRegExpr icon indicating copy to clipboard operation
TRegExpr copied to clipboard

FindRepeatead and Unicode / may break OP_STAR/PLUS/...

Open User4martin opened this issue 1 year ago • 2 comments

I have not further analysed this...

FindRepeated (for unicode) calls IncUnicode2 which may (for surrogates) increment by 2. For the OPs that can match a surrogate this will be a problem.

OP_STAR/.... in MatchPrim will iterate the returned range in steps of one ReChar (codeunit): regInput := save + no;

Also the result of FindRepeated may be the

  • codeunits for OP_ANY (counting a surrogate as 2)
  • "Chars"/full-codepoints for any of the OP_NOT... (counting a surrogate as 1)

One way I can think of (.+).

  • if the last char in the text is a surrogate, then the capture matches half a surrogate
  • if the text is exactly one char, and that is a surrogate, then it incorrectly matches. It needs 2 chars, and takes each half of the surrogate as a full char.

OP_STAR goes back half the surrogate, and then OP_ANY does not check that it matches the 2nd part of a surrogate


This may be fixable (but I have not tested)

  • OP_STAR... in MatchPrim must check regInput := save + no; points to the 2nd part of a surrogate
  • FindRepeated always most return the amount of codeunits (ReChars) / always counting a surrogate as 2.

User4martin avatar Jan 07 '24 13:01 User4martin

On what case (RE, text) does engine fail currently?

Alexey-T avatar Jan 07 '24 13:01 Alexey-T

I only deducted from code review. But https://www.compart.com/de/unicode/U+10000

IsNotMatching('surrogat', '.+.', #$D800#$DC00); fails (it will match).

This is one char. so the .+ should entirely consume it, and leave nothing for the extra ..

Btw, same issue with combining codepoints.


on https://regex101.com/ not all regex handle this either (Python, GoLang, Java seem to do)

User4martin avatar Jan 07 '24 13:01 User4martin