utf8 validator improvements

Open Validark opened this issue 1 year ago • 1 comments

Yesterday I more or less finished and pushed to Github my adapted version of the utf8 validator.

It's in a monofile with some parts of the code still heavily in-dev, but here is a permalink to the code.

My code has a few interesting parts:

I made a SWAR emulation fallback for every single function.
- Everything we need to do in SIMD can be done in SWAR, except vectorized table lookups. For that, we have to go one by one.
- I'm not sure if this is a better strategy than the scalar fallback provided in this library, but it could be something worth thinking about. For me it makes a lot of sense because I want to try to have a single design that works across hardware, and I eventually want to get the compiler to be able to create the SWAR implementations automatically.
The only pieces of data I store in the utf8 validator is the previous-input-chunk and a boolean which keeps track of whether we're in an invalid end state.
- This design is based on the idea that 99.99% of chunks I parse are going to be entirely ascii. Therefore, I do not want to move the data from prev_incomplete to error for chunks of all ascii characters (because most of the time, the data doesn't change). In this library it looks like this:
```
this->error = _mm512_or_si512(this->error, this->prev_incomplete)
```
- I also want to make that errors function cheaper. The vptest instruction is not one cycle. Therefore I don't think it makes sense to make the average case run this check every chunk when the value of error has not changed.
```
simdutf_really_inline bool errors() const {
    return _mm512_test_epi8_mask(this->error, this->error) != 0;
}
```
In my code, I check for utf8 errors in the non-ascii branch as soon as I have that data, and then I write the result of the errors function here into a boolean that says whether it's valid to end on this block. That way, in the average case, we're checking a boolean which hopefully can have the overhead of just one cycle.

Not sure if this design is entirely transferable to simdutf. It looks like it was designed in such a way that you just keep OR'ing with the error vector and you don't check errors() until the end. My idea was tailored to doing an early-out, so that might be different from the goal here. Still, I thought maybe there could be an idea worth looking at.

― Validark

Feb 01 '24 09:02 Validark

Please see validate_utf8_with_errors which implements the functionality you refer to. We will eagerly consider a pull request.

cc @Nick-Nuon

Feb 01 '24 13:02 lemire