Invalid UTF-8 and dot operator
. does not match invalid UTF-8:
var f: tflre;
x: UTF8String;
ext: TFLREMultiStrings;
begin
x := 'abc';
x[2] := #$e8;
f := TFLRE.Create('.+',[rfUTF8]);
writeln(f.UTF8ExtractAll(x,ext));
writeln(ext[0, 0]);
writeln(length(ext[0, 0]));
end.
That is surprising, after all . is used to match everything
When handling user created files, there can always be some invalid characters or even bit errors
Now I understand the problem. If . could match anything and you have >..<, it could match >ä<, since each . could match one of the bytes of the single character ä.
Although, if . would match [ascii/start byte] [non-ascii/start-byte]* it would still work and handle invalid utf-8 as well.
And since the expansion is simpler, it is probably faster, too
FLRE follows here simply the behavior what Google's RE2 (and many other UTF8-capable regex engines) would do here at . in the UTF8 mode. TLDR: In the UTF8 mode is . always codepoint-wise, not codeunit-wise/byte-wise.
But I think, I'll add a extra "all-byte-match" token, which matches always then every byte/codeunit, in the UTF8 and also in the non-UTF8 mode.
But I think, I'll add a extra "all-byte-match" token, which matches always then every byte/codeunit, in the UTF8 and also in the non-UTF8 mode.
Or special byte level character classes. So the start/continuation byte or anything else can be separately matched