jsoup icon indicating copy to clipboard operation
jsoup copied to clipboard

CharacterReader always allocate a 32 KB buffer that can even exceed document size

Open chibenwa opened this issue 3 years ago • 7 comments

This might be overkill and we might be able to decrease the size of this buffer for small strings to be parsed.

This should thus reduce memory allocation.

Here is a small async profiler memory flame graph showing this:

Screenshot from 2022-05-16 15-36-50

chibenwa avatar May 16 '22 08:05 chibenwa

I think the difficulty will be in knowing upfront the expected content length of the response to parse. My understanding in practice is that if it is set, the Content Length header is often incorrect. And so browsers typically ignore it / handle larger sizes.

Have you actually profiled this to be a performance issue? I picked that number based on a bit of empirical benchmarking and testing and found it worked reasonably.

Another approach would be to reuse the buffer on subsequent reads, so the allocation doesn't need to happen. We use that pattern in e.g. StringBuilders.

jhy avatar May 17 '22 06:05 jhy

Have you actually profiled this to be a performance issue? I picked that number based on a bit of empirical benchmarking and testing and found it worked reasonably.

No not that critical, though you get many small HTLM emails in an email server.

chibenwa avatar May 17 '22 07:05 chibenwa

OK yes in that use-case I think it would make sense. A pattern of a threadlocal reusable buffer would work, similar to the Builders in StringUtil.

jhy avatar May 17 '22 10:05 jhy

Thinking more about this, is there something that prevent doing (potentially opt-in) buffer recyclingjust like Jackson JSON parsers does ?

I am working on a similar approach in MIME4J and it did so far yield a 40& allocation reduction that translated in micro-benchmarks into a 10-15% performance improvment.

I would be happy to cary other such a contribution.

chibenwa avatar Jun 19 '22 05:06 chibenwa

I'm not familiar with the implementation of Jackson parser. Recyling the buffer makes sense. I don't know that it should be opt-in as I would expect most people to miss it if it were.

Would be happy to review a PR!

jhy avatar Jun 24 '22 07:06 jhy

Question before getting started: are there some jmh benchmarks somewhere? Would you be interested in getting a set of jmh benchmarks as well?

Will be in vacations all of july so nothing would likely happen before august.

chibenwa avatar Jun 28 '22 04:06 chibenwa

I did succeed to put together https://github.com/jhy/jsoup/pull/1800 before leaving.

Looks like massive gains are ahead ;-)

chibenwa avatar Jul 01 '22 02:07 chibenwa