Implement buffer recycling for CharacterReader
Before
Benchmark Mode Cnt Score Error Units
JMHBenchmark.benchmarkSmall avgt 5 7.382 ± 0.940 us/op
JMHBenchmark.benchmarkSmall:·gc.alloc.rate avgt 5 8572.806 ± 1082.658 MB/sec
JMHBenchmark.benchmarkSmall:·gc.alloc.rate.norm avgt 5 72952.001 ± 0.001 B/op
JMHBenchmark.benchmarkSmall:·gc.churn.G1_Eden_Space avgt 5 8776.474 ± 1059.021 MB/sec
JMHBenchmark.benchmarkSmall:·gc.churn.G1_Eden_Space.norm avgt 5 74688.242 ± 767.418 B/op
JMHBenchmark.benchmarkSmall:·gc.churn.G1_Old_Gen avgt 5 0.195 ± 0.016 MB/sec
JMHBenchmark.benchmarkSmall:·gc.churn.G1_Old_Gen.norm avgt 5 1.664 ± 0.174 B/op
JMHBenchmark.benchmarkSmall:·gc.count avgt 5 523.000 counts
JMHBenchmark.benchmarkSmall:·gc.time avgt 5 761.000 ms
JMHBenchmark.benchmarkMedium avgt 5 29.383 ± 1.479 us/op
JMHBenchmark.benchmarkMedium:·gc.alloc.rate avgt 5 2531.961 ± 127.589 MB/sec
JMHBenchmark.benchmarkMedium:·gc.alloc.rate.norm avgt 5 85824.002 ± 0.001 B/op
JMHBenchmark.benchmarkMedium:·gc.churn.G1_Eden_Space avgt 5 2522.477 ± 106.173 MB/sec
JMHBenchmark.benchmarkMedium:·gc.churn.G1_Eden_Space.norm avgt 5 85506.478 ± 2628.227 B/op
JMHBenchmark.benchmarkMedium:·gc.churn.G1_Old_Gen avgt 5 0.207 ± 0.036 MB/sec
JMHBenchmark.benchmarkMedium:·gc.churn.G1_Old_Gen.norm avgt 5 7.031 ± 1.401 B/op
JMHBenchmark.benchmarkMedium:·gc.count avgt 5 372.000 counts
JMHBenchmark.benchmarkMedium:·gc.time avgt 5 371.000 ms
After
Benchmark Mode Cnt Score Error Units
JMHBenchmark.benchmarkSmall avgt 5 2.038 ± 0.102 us/op
JMHBenchmark.benchmarkSmall:·gc.alloc.rate avgt 5 3205.214 ± 160.609 MB/sec
JMHBenchmark.benchmarkSmall:·gc.alloc.rate.norm avgt 5 7536.000 ± 0.001 B/op
JMHBenchmark.benchmarkSmall:·gc.churn.G1_Eden_Space avgt 5 3175.262 ± 132.074 MB/sec
JMHBenchmark.benchmarkSmall:·gc.churn.G1_Eden_Space.norm avgt 5 7465.807 ± 139.856 B/op
JMHBenchmark.benchmarkSmall:·gc.churn.G1_Old_Gen avgt 5 0.194 ± 0.016 MB/sec
JMHBenchmark.benchmarkSmall:·gc.churn.G1_Old_Gen.norm avgt 5 0.455 ± 0.038 B/op
JMHBenchmark.benchmarkSmall:·gc.count avgt 5 439.000 counts
JMHBenchmark.benchmarkSmall:·gc.time avgt 5 397.000 ms
JMHBenchmark.benchmarkMedium avgt 5 21.085 ± 1.112 us/op
JMHBenchmark.benchmarkMedium:·gc.alloc.rate avgt 5 833.438 ± 44.777 MB/sec
JMHBenchmark.benchmarkMedium:·gc.alloc.rate.norm avgt 5 20272.002 ± 0.001 B/op
JMHBenchmark.benchmarkMedium:·gc.churn.G1_Eden_Space avgt 5 827.537 ± 70.651 MB/sec
JMHBenchmark.benchmarkMedium:·gc.churn.G1_Eden_Space.norm avgt 5 20126.922 ± 783.745 B/op
JMHBenchmark.benchmarkMedium:·gc.churn.G1_Old_Gen avgt 5 0.206 ± 0.016 MB/sec
JMHBenchmark.benchmarkMedium:·gc.churn.G1_Old_Gen.norm avgt 5 5.021 ± 0.162 B/op
JMHBenchmark.benchmarkMedium:·gc.count avgt 5 268.000 counts
JMHBenchmark.benchmarkMedium:·gc.time avgt 5 193.000 ms
Conclusion
This changeset significantly improves the memory efficiency of JSOUP which turns into massive performance gains.
Good stuff, thanks! I may move some things around (not sure if the recycler should be a public contract API, etc).
Enjoy your vacation :)
The next big thing is improving the efficiency of the InputStream based APIs
@Benchmark
public void benchmarkMediumInputStream(Blackhole bh) throws Exception{
bh.consume(Jsoup.parse(new ByteArrayInputStream(CONTENT_MEDIUM_AS_BYTES), StandardCharsets.UTF_8.name(), ""));
}
Yields
Benchmark Mode Cnt Score Error Units
JMHBenchmark.benchmarkMediumInputStream avgt 5 35.285 ± 1.339 us/op
JMHBenchmark.benchmarkMediumInputStream:·gc.alloc.rate avgt 5 3504.566 ± 133.319 MB/sec
JMHBenchmark.benchmarkMediumInputStream:·gc.alloc.rate.norm avgt 5 142656.003 ± 0.001 B/op
JMHBenchmark.benchmarkMediumInputStream:·gc.churn.G1_Eden_Space avgt 5 3497.228 ± 134.669 MB/sec
JMHBenchmark.benchmarkMediumInputStream:·gc.churn.G1_Eden_Space.norm avgt 5 142359.431 ± 3459.922 B/op
JMHBenchmark.benchmarkMediumInputStream:·gc.churn.G1_Old_Gen avgt 5 0.379 ± 0.356 MB/sec
JMHBenchmark.benchmarkMediumInputStream:·gc.churn.G1_Old_Gen.norm avgt 5 15.420 ± 13.903 B/op
JMHBenchmark.benchmarkMediumInputStream:·gc.count avgt 5 416.000 counts
JMHBenchmark.benchmarkMediumInputStream:·gc.time avgt 5 423.000 ms
Which memory wise is terrible compared to the strign version... I'd be better of creating the String prior parsing rather than supplying the imputStream....
Benchmark Mode Cnt Score Error Units
JMHBenchmark.benchmarkMedium avgt 5 20.314 ± 1.621 us/op
JMHBenchmark.benchmarkMedium:·gc.alloc.rate avgt 5 654.901 ± 51.181 MB/sec
JMHBenchmark.benchmarkMedium:·gc.alloc.rate.norm avgt 5 15344.002 ± 0.001 B/op
JMHBenchmark.benchmarkMedium:·gc.churn.G1_Eden_Space avgt 5 650.845 ± 57.968 MB/sec
JMHBenchmark.benchmarkMedium:·gc.churn.G1_Eden_Space.norm avgt 5 15248.663 ± 495.411 B/op
JMHBenchmark.benchmarkMedium:·gc.churn.G1_Old_Gen avgt 5 0.193 ± 0.031 MB/sec
JMHBenchmark.benchmarkMedium:·gc.churn.G1_Old_Gen.norm avgt 5 4.533 ± 0.927 B/op
JMHBenchmark.benchmarkMedium:·gc.count avgt 5 244.000 counts
JMHBenchmark.benchmarkMedium:·gc.time avgt 5 174.000 ms
JMHBenchmark.benchmarkMediumInputStream avgt 5 32.855 ± 1.372 us/op
JMHBenchmark.benchmarkMediumInputStream:·gc.alloc.rate avgt 5 3633.766 ± 150.956 MB/sec
JMHBenchmark.benchmarkMediumInputStream:·gc.alloc.rate.norm avgt 5 137728.003 ± 0.001 B/op
JMHBenchmark.benchmarkMediumInputStream:·gc.churn.G1_Eden_Space avgt 5 3632.474 ± 189.650 MB/sec
JMHBenchmark.benchmarkMediumInputStream:·gc.churn.G1_Eden_Space.norm avgt 5 137676.632 ± 2596.732 B/op
JMHBenchmark.benchmarkMediumInputStream:·gc.churn.G1_Old_Gen avgt 5 0.339 ± 0.368 MB/sec
JMHBenchmark.benchmarkMediumInputStream:·gc.churn.G1_Old_Gen.norm avgt 5 12.879 ± 14.070 B/op
JMHBenchmark.benchmarkMediumInputStream:·gc.count avgt 5 420.000 counts
JMHBenchmark.benchmarkMediumInputStream:·gc.time avgt 5 427.000 ms
JMHBenchmark.benchmarkSmall avgt 5 1.656 ± 0.035 us/op
JMHBenchmark.benchmarkSmall:·gc.alloc.rate avgt 5 1699.708 ± 36.050 MB/sec
JMHBenchmark.benchmarkSmall:·gc.alloc.rate.norm avgt 5 3248.000 ± 0.001 B/op
JMHBenchmark.benchmarkSmall:·gc.churn.G1_Eden_Space avgt 5 1681.110 ± 63.848 MB/sec
JMHBenchmark.benchmarkSmall:·gc.churn.G1_Eden_Space.norm avgt 5 3212.422 ± 74.478 B/op
JMHBenchmark.benchmarkSmall:·gc.churn.G1_Old_Gen avgt 5 0.186 ± 0.015 MB/sec
JMHBenchmark.benchmarkSmall:·gc.churn.G1_Old_Gen.norm avgt 5 0.355 ± 0.034 B/op
JMHBenchmark.benchmarkSmall:·gc.count avgt 5 337.000 counts
JMHBenchmark.benchmarkSmall:·gc.time avgt 5 270.000 ms
(Force pushed to solve the conflict)
Hello @jhy , any status on this work?
Also would it make sense to add a JVM system property to turn off the pooling behavior if not desired by the end user?
Hello @jhy
What can I do to help moving on on this topic?
My apologies for the radio silience @chibenwa, I was pulled away from this.
I have a refactoring of this to make the recyclers more generic, and will push an update for your review shortly.
Apologies for the force-push, but I wanted to bring the branch current to HEAD.
Here's a WIP of the refactoring to use a generic pool approach. I haven't profiled it much yet. I'd be glad for your review and profiling.
A couple points:
- I temporarily disabled the benchmark just on committing these, but please re-enable for your testing. When we land this, I think I will remove the benchmark from the commit, and move out to a distinct repo (so that versions of jsoup can be compared more readily)
- I made the char buffer size constant vs allocating smaller ephemeral buffers if the input was smaller than the buffer size. I figure that because we are using a constant pool, it will be better to recycle somewhat larger objects than thrashing on smaller objects
- I want to make the DataUtil buffers use this too - I think I can just remove the BufferedReader (which internally allocates a new buffer) because the CharacterReader implements its own buffer. Two is redundant.
- It would be good if the small HTML test had more varied inputs. Otherwise the String Cache in CharacterReader is always going to have an artificially high hit rate, which might bias the performance tests.
- I don't know that we need to make these pools optional via a tuning configuration. I would prefer to just get it to work well for all circumstances. Just one less thing for the developer to worry about.
- In contrast to the last point however, I wonder if we should make the StringCache optional, or automatically switch it on/off depending on the platform (server or Android). I have found during perf testing that the overhead of maintaining it when not on Android is worse than the allocation cost.
(Am getting some build failures which are from the FuzzFixes tests -- some are timing out. Those have been sensitive to the CPU allocation on the workers so am not sure if it's just GitHub actions running a bit slower, or if these changes caused any slowness. I don't believe it's the latter, as my local perf tests show improvements, but we need to dig in.)
(The Windows builds are failing - looks like for HTTP fetches the connection length is not getting fully read in some case. Need to investigate.)