utf8 validator improvements
Yesterday I more or less finished and pushed to Github my adapted version of the utf8 validator.
It's in a monofile with some parts of the code still heavily in-dev, but here is a permalink to the code.
My code has a few interesting parts:
- I made a SWAR emulation fallback for every single function.
- Everything we need to do in SIMD can be done in SWAR, except vectorized table lookups. For that, we have to go one by one.
- I'm not sure if this is a better strategy than the scalar fallback provided in this library, but it could be something worth thinking about. For me it makes a lot of sense because I want to try to have a single design that works across hardware, and I eventually want to get the compiler to be able to create the SWAR implementations automatically.
- The only pieces of data I store in the utf8 validator is the previous-input-chunk and a boolean which keeps track of whether we're in an invalid end state.
- This design is based on the idea that 99.99% of chunks I parse are going to be entirely ascii. Therefore, I do not want to move the data from
prev_incompletetoerrorfor chunks of all ascii characters (because most of the time, the data doesn't change). In this library it looks like this:
this->error = _mm512_or_si512(this->error, this->prev_incomplete)- I also want to make that
errorsfunction cheaper. Thevptestinstruction is not one cycle. Therefore I don't think it makes sense to make the average case run this check every chunk when the value of error has not changed.
In my code, I check for utf8 errors in the non-ascii branch as soon as I have that data, and then I write the result of thesimdutf_really_inline bool errors() const { return _mm512_test_epi8_mask(this->error, this->error) != 0; }errorsfunction here into a boolean that says whether it's valid to end on this block. That way, in the average case, we're checking a boolean which hopefully can have the overhead of just one cycle. - This design is based on the idea that 99.99% of chunks I parse are going to be entirely ascii. Therefore, I do not want to move the data from
Not sure if this design is entirely transferable to simdutf. It looks like it was designed in such a way that you just keep OR'ing with the error vector and you don't check errors() until the end. My idea was tailored to doing an early-out, so that might be different from the goal here. Still, I thought maybe there could be an idea worth looking at.
― Validark
Please see validate_utf8_with_errors which implements the functionality you refer to. We will eagerly consider a pull request.
cc @Nick-Nuon