simdjson-java icon indicating copy to clipboard operation
simdjson-java copied to clipboard

Use of two table lookups instead of existing three table lookups for Utf8Validator.

Open jatin-bhateja opened this issue 3 months ago • 0 comments
trafficstars

I have been experimenting with Utf8Validator and find that existing handling uses three lookup tables which are indexed by upper and lower nibbles of first byte and upper nibble of second byte in the pair of consecutive bytes to catch various error scenarios.

Effectively, we refer to twelve bits, 8 from first byte and 4 from second bytes for lookups in 16 byte tables. Following PoC implimentation[1] uses two 64 byte lookup tables accessed using 6 bit indices. For first lookup, index is compsed of least signifianct 6 bits of first byte and for second lookup index concatinates upper nibble of second byte with most significant two bits from first byte.

I see around 5-7% performance improvement[2] over three table lookup.

Algorithm can be directly ported to Utf8Validator.

Best Regards, Jatin

[1] https://github.com/jatin-bhateja/external_staging/blob/main/Code/java/vector-api/simd_json/ThreeVsTwoTableLookup.java [2] https://github.com/jatin-bhateja/external_staging/blob/main/Code/java/vector-api/simd_json/performance_3Tvs2T_lookup.txt

jatin-bhateja avatar Aug 12 '25 19:08 jatin-bhateja