simdutf icon indicating copy to clipboard operation
simdutf copied to clipboard

Clarification on result struct

Open david-sledge opened this issue 1 year ago • 2 comments

With the result struct, if an error occurs in a multi-word code point, does position point to the beginning of the code point that can't be validated or does it point to the word that causes the whole code point to be invalid? Example: given the following three-byte string, what is position? One or two?

0xxxxxxx 11000010 0xxxxxxx

Also, in the error_code enum, OTHER implies a scenario that doesn't fit the other values. An example of something that would fall into OTHER would be useful.

david-sledge avatar Aug 08 '22 23:08 david-sledge

@NicolasJiaxin Want to take up this question?

lemire avatar Aug 09 '22 00:08 lemire

@david-sledge In the case you described, it would be at position 1 (i.e. at 11000010). Basically, position is the index from where the input is not valid anymore. Or another way to put it is that the truncated string from index 0 to position-1 can be transcoded/validated without errors. So, for HEADER_BITS, TOO_LARGE and SURROGATE, they have position to be the index of the troublesome word in question. For TOO_SHORT, it is at the start of the codepoint. For OVERLONG, it is also at the start of the codepoint. And for TOO_LONG, it is at the first extra continuation byte.

As for the OTHER error, we only use it when the architecture detected is not supported for now. It is not related to the encodings at all.

I will clarify all of that in the documentation. Hopefully, I explained it clearly. Thanks for the question!

NicolasJiaxin avatar Aug 09 '22 01:08 NicolasJiaxin

Closing. The reporter should open a new issue if the answer is not satisfying.

lemire avatar Oct 27 '22 15:10 lemire