jackson-core
jackson-core copied to clipboard
Novel escaping algorithm
Is your feature request related to a problem? Please describe.
Hi,
I made a novel, simple and super fast branch-free escaping algorithm at https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/pull/116 and I would like to contributi here if it is welcome. Any feedback? Where I should look at?
Describe the solution you'd like
It's based on https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/pull/116
Usage example
No response
Additional context
No response
Can you provide an example in jackson-databind where this would be used?
yes, actually I see this could be "partially" applied here -> https://github.com/FasterXML/jackson-core/blob/2.19/src/main/java/com/fasterxml/jackson/core/json/UTF8JsonGenerator.java#L1758-L1765
What you got here is exactly the same as https://lemire.me/blog/2024/10/14/table-lookups-are-efficient/ which I have improved in https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/blob/6236934cffb0de1d3f17dca43d242b735a4a2125/2024/10/14/src/main/java/me/lemire/MyBenchmark.java#L258-L269
i.e.
- the latin replacement path should lookup into a int[] table which pack: how many chars the replacement has + the two bytes replacements
- the non replaceable chars should still have an entry in the lookup table but which contains 1 as replacement length and the original char with a bogus byte (0? who cares, the length of the replacement won't make it important)
- we should always write 2 chars regardless and move the output index using what we read int the lookup table (yes, it means always writing and than overwriting)
In this way we can reduce the number of branches trying to optimize whatever is latin and NOT belonging to special control chars (0-31 IIRC?). Not fully branch-free since it should requires a much deeper change, really - but near to for the "common" cases
Might be some partial overlap with #1349. Jackson-core would probably be a better place for an escaping implementation.
yep it looks like it could be part of such, but please :pray: suggest to that user to use JMH and not handrolled bench... :cry:
Yes, escaping definitely in jackson-core, UTF8JsonGenerator sounds like the place as well.
Would perhaps suggest working on master branch (for Jackson 3.0), although jackson-core diffs b/w 2.x and 3.0 are not huge.
yep it looks like it could be part of such, but please :pray: suggest to that user to use JMH and not handrolled bench... :cry:
Lol, JMH was used initially (as mentioned in the PR's description) but isn't being used in this repo, hence this manual stuff was copied from an existing test. Yep, that was painful to author :')
There's also vectorization opportunities here as mentioned in the PR description, but Java's vectorization support is still incubating I believe. I don't know if this decoding loop gets auto-vectorized by the JIT even without an explicit SIMD impl though.