utf8.lua icon indicating copy to clipboard operation
utf8.lua copied to clipboard

Regular expressions can contain invalid utf8 byte sequences.

Open Stepets opened this issue 3 years ago • 0 comments

For now library assumes that regular expressions and text strings are valid utf8 and for optimization looks only on character head byte to determine where is next character begins.

It doesn't work with raw bytes. While purpose for this library is to hide underlaying byte processing this approach brings incompatibility with vanilla string library.

I suppose working with broken utf8 strings and searching in them raw byte regexes is quite rare use-case. So I wouldn't fix it for now but will provide insights on how it can be fixed.

One of core functions of this library is utf8next. It takes text with byte index in it and returns head byte index of following utf8 character. It uses utf8charbytes that works without utf8 character validation. https://github.com/Stepets/utf8.lua/blob/17f4e009a22fb2f2e6ad316a05b2cca8e071fc3b/primitives/dummy.lua#L86-L92

Also there is utf8validate function that uses utf8validator as iterator function. https://github.com/Stepets/utf8.lua/blob/17f4e009a22fb2f2e6ad316a05b2cca8e071fc3b/primitives/dummy.lua#L390-L398

utf8validator takes text with byte index in it and determines supposed utf8 character length. Then it checks byte after byte and returns either following utf8 character head byte position or position of byte that breaks utf8 sequence. So utf8validator might be used instead utf8next as is (needs testing).

Next is configuration. I think it could be just flag named something like utf8_valid_strings. utf8.next should be set accordingly to this flag value https://github.com/Stepets/utf8.lua/blob/17f4e009a22fb2f2e6ad316a05b2cca8e071fc3b/primitives/dummy.lua#L527

Stepets avatar Aug 11 '21 15:08 Stepets