flre icon indicating copy to clipboard operation
flre copied to clipboard

Invalid UTF-8 and dot operator

Open benibela opened this issue 9 years ago • 4 comments

. does not match invalid UTF-8:

var f: tflre;
  x: UTF8String;
  ext: TFLREMultiStrings;
begin
  x := 'abc';
  x[2] := #$e8;
  f := TFLRE.Create('.+',[rfUTF8]);
  writeln(f.UTF8ExtractAll(x,ext));
  writeln(ext[0, 0]);
  writeln(length(ext[0, 0]));
end.

That is surprising, after all . is used to match everything

When handling user created files, there can always be some invalid characters or even bit errors

benibela avatar May 11 '16 19:05 benibela

Now I understand the problem. If . could match anything and you have >..<, it could match >ä<, since each . could match one of the bytes of the single character ä.

Although, if . would match [ascii/start byte] [non-ascii/start-byte]* it would still work and handle invalid utf-8 as well.

benibela avatar May 17 '16 17:05 benibela

And since the expansion is simpler, it is probably faster, too

benibela avatar May 17 '16 17:05 benibela

FLRE follows here simply the behavior what Google's RE2 (and many other UTF8-capable regex engines) would do here at . in the UTF8 mode. TLDR: In the UTF8 mode is . always codepoint-wise, not codeunit-wise/byte-wise.

But I think, I'll add a extra "all-byte-match" token, which matches always then every byte/codeunit, in the UTF8 and also in the non-UTF8 mode.

BeRo1985 avatar Jun 14 '16 13:06 BeRo1985

But I think, I'll add a extra "all-byte-match" token, which matches always then every byte/codeunit, in the UTF8 and also in the non-UTF8 mode.

Or special byte level character classes. So the start/continuation byte or anything else can be separately matched

benibela avatar Jun 16 '16 09:06 benibela