utf8.lua
utf8.lua copied to clipboard
Regular expressions can contain invalid utf8 byte sequences.
For now library assumes that regular expressions and text strings are valid utf8 and for optimization looks only on character head byte to determine where is next character begins.
It doesn't work with raw bytes. While purpose for this library is to hide underlaying byte processing this approach brings incompatibility with vanilla string library.
I suppose working with broken utf8 strings and searching in them raw byte regexes is quite rare use-case. So I wouldn't fix it for now but will provide insights on how it can be fixed.
One of core functions of this library is utf8next
. It takes text with byte index in it and returns head byte index of following utf8 character. It uses utf8charbytes
that works without utf8 character validation.
https://github.com/Stepets/utf8.lua/blob/17f4e009a22fb2f2e6ad316a05b2cca8e071fc3b/primitives/dummy.lua#L86-L92
Also there is utf8validate
function that uses utf8validator
as iterator function. https://github.com/Stepets/utf8.lua/blob/17f4e009a22fb2f2e6ad316a05b2cca8e071fc3b/primitives/dummy.lua#L390-L398
utf8validator
takes text with byte index in it and determines supposed utf8 character length. Then it checks byte after byte and returns either following utf8 character head byte position or position of byte that breaks utf8 sequence. So utf8validator
might be used instead utf8next
as is (needs testing).
Next is configuration. I think it could be just flag named something like utf8_valid_strings
. utf8.next
should be set accordingly to this flag value https://github.com/Stepets/utf8.lua/blob/17f4e009a22fb2f2e6ad316a05b2cca8e071fc3b/primitives/dummy.lua#L527