jackson-core icon indicating copy to clipboard operation
jackson-core copied to clipboard

Novel escaping algorithm

Open franz1981 opened this issue 10 months ago • 6 comments

Is your feature request related to a problem? Please describe.

Hi,

I made a novel, simple and super fast branch-free escaping algorithm at https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/pull/116 and I would like to contributi here if it is welcome. Any feedback? Where I should look at?

Describe the solution you'd like

It's based on https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/pull/116

Usage example

No response

Additional context

No response

franz1981 avatar Jan 10 '25 09:01 franz1981

Can you provide an example in jackson-databind where this would be used?

pjfanning avatar Jan 10 '25 11:01 pjfanning

yes, actually I see this could be "partially" applied here -> https://github.com/FasterXML/jackson-core/blob/2.19/src/main/java/com/fasterxml/jackson/core/json/UTF8JsonGenerator.java#L1758-L1765

What you got here is exactly the same as https://lemire.me/blog/2024/10/14/table-lookups-are-efficient/ which I have improved in https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/blob/6236934cffb0de1d3f17dca43d242b735a4a2125/2024/10/14/src/main/java/me/lemire/MyBenchmark.java#L258-L269

i.e.

  • the latin replacement path should lookup into a int[] table which pack: how many chars the replacement has + the two bytes replacements
  • the non replaceable chars should still have an entry in the lookup table but which contains 1 as replacement length and the original char with a bogus byte (0? who cares, the length of the replacement won't make it important)
  • we should always write 2 chars regardless and move the output index using what we read int the lookup table (yes, it means always writing and than overwriting)

In this way we can reduce the number of branches trying to optimize whatever is latin and NOT belonging to special control chars (0-31 IIRC?). Not fully branch-free since it should requires a much deeper change, really - but near to for the "common" cases

franz1981 avatar Jan 10 '25 13:01 franz1981

Might be some partial overlap with #1349. Jackson-core would probably be a better place for an escaping implementation.

pjfanning avatar Jan 10 '25 14:01 pjfanning

yep it looks like it could be part of such, but please :pray: suggest to that user to use JMH and not handrolled bench... :cry:

franz1981 avatar Jan 10 '25 14:01 franz1981

Yes, escaping definitely in jackson-core, UTF8JsonGenerator sounds like the place as well.

Would perhaps suggest working on master branch (for Jackson 3.0), although jackson-core diffs b/w 2.x and 3.0 are not huge.

cowtowncoder avatar Jan 10 '25 17:01 cowtowncoder

yep it looks like it could be part of such, but please :pray: suggest to that user to use JMH and not handrolled bench... :cry:

Lol, JMH was used initially (as mentioned in the PR's description) but isn't being used in this repo, hence this manual stuff was copied from an existing test. Yep, that was painful to author :')

There's also vectorization opportunities here as mentioned in the PR description, but Java's vectorization support is still incubating I believe. I don't know if this decoding loop gets auto-vectorized by the JIT even without an explicit SIMD impl though.

JoostK avatar Apr 08 '25 03:04 JoostK