simdutf
simdutf copied to clipboard
Clarification on result struct
With the result
struct, if an error occurs in a multi-word code point, does position
point to the beginning of the code point that can't be validated or does it point to the word that causes the whole code point to be invalid? Example: given the following three-byte string, what is position
? One or two?
0xxxxxxx 11000010 0xxxxxxx
Also, in the error_code
enum, OTHER
implies a scenario that doesn't fit the other values. An example of something that would fall into OTHER
would be useful.
@NicolasJiaxin Want to take up this question?
@david-sledge In the case you described, it would be at position 1 (i.e. at 11000010
). Basically, position
is the index from where the input is not valid anymore. Or another way to put it is that the truncated string from index 0
to position-1
can be transcoded/validated without errors. So, for HEADER_BITS
, TOO_LARGE
and SURROGATE
, they have position
to be the index of the troublesome word in question. For TOO_SHORT
, it is at the start of the codepoint. For OVERLONG
, it is also at the start of the codepoint. And for TOO_LONG
, it is at the first extra continuation byte.
As for the OTHER
error, we only use it when the architecture detected is not supported for now. It is not related to the encodings at all.
I will clarify all of that in the documentation. Hopefully, I explained it clearly. Thanks for the question!
Closing. The reporter should open a new issue if the answer is not satisfying.