Before

Benchmark                                                  Mode  Cnt      Score      Error   Units
JMHBenchmark.benchmarkSmall                                avgt    5      7.382 ±    0.940   us/op
JMHBenchmark.benchmarkSmall:·gc.alloc.rate                 avgt    5   8572.806 ± 1082.658  MB/sec
JMHBenchmark.benchmarkSmall:·gc.alloc.rate.norm            avgt    5  72952.001 ±    0.001    B/op
JMHBenchmark.benchmarkSmall:·gc.churn.G1_Eden_Space        avgt    5   8776.474 ± 1059.021  MB/sec
JMHBenchmark.benchmarkSmall:·gc.churn.G1_Eden_Space.norm   avgt    5  74688.242 ±  767.418    B/op
JMHBenchmark.benchmarkSmall:·gc.churn.G1_Old_Gen           avgt    5      0.195 ±    0.016  MB/sec
JMHBenchmark.benchmarkSmall:·gc.churn.G1_Old_Gen.norm      avgt    5      1.664 ±    0.174    B/op
JMHBenchmark.benchmarkSmall:·gc.count                      avgt    5    523.000             counts
JMHBenchmark.benchmarkSmall:·gc.time                       avgt    5    761.000                 ms
JMHBenchmark.benchmarkMedium                               avgt    5     29.383 ±    1.479   us/op
JMHBenchmark.benchmarkMedium:·gc.alloc.rate                avgt    5   2531.961 ±  127.589  MB/sec
JMHBenchmark.benchmarkMedium:·gc.alloc.rate.norm           avgt    5  85824.002 ±    0.001    B/op
JMHBenchmark.benchmarkMedium:·gc.churn.G1_Eden_Space       avgt    5   2522.477 ±  106.173  MB/sec
JMHBenchmark.benchmarkMedium:·gc.churn.G1_Eden_Space.norm  avgt    5  85506.478 ± 2628.227    B/op
JMHBenchmark.benchmarkMedium:·gc.churn.G1_Old_Gen          avgt    5      0.207 ±    0.036  MB/sec
JMHBenchmark.benchmarkMedium:·gc.churn.G1_Old_Gen.norm     avgt    5      7.031 ±    1.401    B/op
JMHBenchmark.benchmarkMedium:·gc.count                     avgt    5    372.000             counts
JMHBenchmark.benchmarkMedium:·gc.time                      avgt    5    371.000                 ms

After

Benchmark                                                  Mode  Cnt      Score     Error   Units
JMHBenchmark.benchmarkSmall                                avgt    5      2.038 ±   0.102   us/op
JMHBenchmark.benchmarkSmall:·gc.alloc.rate                 avgt    5   3205.214 ± 160.609  MB/sec
JMHBenchmark.benchmarkSmall:·gc.alloc.rate.norm            avgt    5   7536.000 ±   0.001    B/op
JMHBenchmark.benchmarkSmall:·gc.churn.G1_Eden_Space        avgt    5   3175.262 ± 132.074  MB/sec
JMHBenchmark.benchmarkSmall:·gc.churn.G1_Eden_Space.norm   avgt    5   7465.807 ± 139.856    B/op
JMHBenchmark.benchmarkSmall:·gc.churn.G1_Old_Gen           avgt    5      0.194 ±   0.016  MB/sec
JMHBenchmark.benchmarkSmall:·gc.churn.G1_Old_Gen.norm      avgt    5      0.455 ±   0.038    B/op
JMHBenchmark.benchmarkSmall:·gc.count                      avgt    5    439.000            counts
JMHBenchmark.benchmarkSmall:·gc.time                       avgt    5    397.000                ms
JMHBenchmark.benchmarkMedium                               avgt    5     21.085 ±   1.112   us/op
JMHBenchmark.benchmarkMedium:·gc.alloc.rate                avgt    5    833.438 ±  44.777  MB/sec
JMHBenchmark.benchmarkMedium:·gc.alloc.rate.norm           avgt    5  20272.002 ±   0.001    B/op
JMHBenchmark.benchmarkMedium:·gc.churn.G1_Eden_Space       avgt    5    827.537 ±  70.651  MB/sec
JMHBenchmark.benchmarkMedium:·gc.churn.G1_Eden_Space.norm  avgt    5  20126.922 ± 783.745    B/op
JMHBenchmark.benchmarkMedium:·gc.churn.G1_Old_Gen          avgt    5      0.206 ±   0.016  MB/sec
JMHBenchmark.benchmarkMedium:·gc.churn.G1_Old_Gen.norm     avgt    5      5.021 ±   0.162    B/op
JMHBenchmark.benchmarkMedium:·gc.count                     avgt    5    268.000            counts
JMHBenchmark.benchmarkMedium:·gc.time                      avgt    5    193.000                ms

Conclusion

This changeset significantly improves the memory efficiency of JSOUP which turns into massive performance gains.

Jul 01 '22 02:07 chibenwa

Good stuff, thanks! I may move some things around (not sure if the recycler should be a public contract API, etc).

Enjoy your vacation :)

Jul 01 '22 02:07 jhy

The next big thing is improving the efficiency of the InputStream based APIs

    @Benchmark
    public void benchmarkMediumInputStream(Blackhole bh) throws Exception{
        bh.consume(Jsoup.parse(new ByteArrayInputStream(CONTENT_MEDIUM_AS_BYTES), StandardCharsets.UTF_8.name(), ""));
    }

Yields

Benchmark                                                             Mode  Cnt       Score      Error   Units
JMHBenchmark.benchmarkMediumInputStream                               avgt    5      35.285 ±    1.339   us/op
JMHBenchmark.benchmarkMediumInputStream:·gc.alloc.rate                avgt    5    3504.566 ±  133.319  MB/sec
JMHBenchmark.benchmarkMediumInputStream:·gc.alloc.rate.norm           avgt    5  142656.003 ±    0.001    B/op
JMHBenchmark.benchmarkMediumInputStream:·gc.churn.G1_Eden_Space       avgt    5    3497.228 ±  134.669  MB/sec
JMHBenchmark.benchmarkMediumInputStream:·gc.churn.G1_Eden_Space.norm  avgt    5  142359.431 ± 3459.922    B/op
JMHBenchmark.benchmarkMediumInputStream:·gc.churn.G1_Old_Gen          avgt    5       0.379 ±    0.356  MB/sec
JMHBenchmark.benchmarkMediumInputStream:·gc.churn.G1_Old_Gen.norm     avgt    5      15.420 ±   13.903    B/op
JMHBenchmark.benchmarkMediumInputStream:·gc.count                     avgt    5     416.000             counts
JMHBenchmark.benchmarkMediumInputStream:·gc.time                      avgt    5     423.000                 ms

Which memory wise is terrible compared to the strign version... I'd be better of creating the String prior parsing rather than supplying the imputStream....

Jul 01 '22 03:07 chibenwa

Benchmark                                                             Mode  Cnt       Score      Error   Units
JMHBenchmark.benchmarkMedium                                          avgt    5      20.314 ±    1.621   us/op
JMHBenchmark.benchmarkMedium:·gc.alloc.rate                           avgt    5     654.901 ±   51.181  MB/sec
JMHBenchmark.benchmarkMedium:·gc.alloc.rate.norm                      avgt    5   15344.002 ±    0.001    B/op
JMHBenchmark.benchmarkMedium:·gc.churn.G1_Eden_Space                  avgt    5     650.845 ±   57.968  MB/sec
JMHBenchmark.benchmarkMedium:·gc.churn.G1_Eden_Space.norm             avgt    5   15248.663 ±  495.411    B/op
JMHBenchmark.benchmarkMedium:·gc.churn.G1_Old_Gen                     avgt    5       0.193 ±    0.031  MB/sec
JMHBenchmark.benchmarkMedium:·gc.churn.G1_Old_Gen.norm                avgt    5       4.533 ±    0.927    B/op
JMHBenchmark.benchmarkMedium:·gc.count                                avgt    5     244.000             counts
JMHBenchmark.benchmarkMedium:·gc.time                                 avgt    5     174.000                 ms
JMHBenchmark.benchmarkMediumInputStream                               avgt    5      32.855 ±    1.372   us/op
JMHBenchmark.benchmarkMediumInputStream:·gc.alloc.rate                avgt    5    3633.766 ±  150.956  MB/sec
JMHBenchmark.benchmarkMediumInputStream:·gc.alloc.rate.norm           avgt    5  137728.003 ±    0.001    B/op
JMHBenchmark.benchmarkMediumInputStream:·gc.churn.G1_Eden_Space       avgt    5    3632.474 ±  189.650  MB/sec
JMHBenchmark.benchmarkMediumInputStream:·gc.churn.G1_Eden_Space.norm  avgt    5  137676.632 ± 2596.732    B/op
JMHBenchmark.benchmarkMediumInputStream:·gc.churn.G1_Old_Gen          avgt    5       0.339 ±    0.368  MB/sec
JMHBenchmark.benchmarkMediumInputStream:·gc.churn.G1_Old_Gen.norm     avgt    5      12.879 ±   14.070    B/op
JMHBenchmark.benchmarkMediumInputStream:·gc.count                     avgt    5     420.000             counts
JMHBenchmark.benchmarkMediumInputStream:·gc.time                      avgt    5     427.000                 ms
JMHBenchmark.benchmarkSmall                                           avgt    5       1.656 ±    0.035   us/op
JMHBenchmark.benchmarkSmall:·gc.alloc.rate                            avgt    5    1699.708 ±   36.050  MB/sec
JMHBenchmark.benchmarkSmall:·gc.alloc.rate.norm                       avgt    5    3248.000 ±    0.001    B/op
JMHBenchmark.benchmarkSmall:·gc.churn.G1_Eden_Space                   avgt    5    1681.110 ±   63.848  MB/sec
JMHBenchmark.benchmarkSmall:·gc.churn.G1_Eden_Space.norm              avgt    5    3212.422 ±   74.478    B/op
JMHBenchmark.benchmarkSmall:·gc.churn.G1_Old_Gen                      avgt    5       0.186 ±    0.015  MB/sec
JMHBenchmark.benchmarkSmall:·gc.churn.G1_Old_Gen.norm                 avgt    5       0.355 ±    0.034    B/op
JMHBenchmark.benchmarkSmall:·gc.count                                 avgt    5     337.000             counts
JMHBenchmark.benchmarkSmall:·gc.time                                  avgt    5     270.000                 ms

Jul 01 '22 04:07 chibenwa

(Force pushed to solve the conflict)

Jul 01 '22 04:07 chibenwa

Hello @jhy , any status on this work?

Also would it make sense to add a JVM system property to turn off the pooling behavior if not desired by the end user?

Sep 13 '22 09:09 chibenwa

Hello @jhy

What can I do to help moving on on this topic?

Nov 14 '22 05:11 chibenwa

My apologies for the radio silience @chibenwa, I was pulled away from this.

I have a refactoring of this to make the recyclers more generic, and will push an update for your review shortly.

Jan 06 '23 07:01 jhy

Apologies for the force-push, but I wanted to bring the branch current to HEAD.

Here's a WIP of the refactoring to use a generic pool approach. I haven't profiled it much yet. I'd be glad for your review and profiling.

A couple points:

I temporarily disabled the benchmark just on committing these, but please re-enable for your testing. When we land this, I think I will remove the benchmark from the commit, and move out to a distinct repo (so that versions of jsoup can be compared more readily)
I made the char buffer size constant vs allocating smaller ephemeral buffers if the input was smaller than the buffer size. I figure that because we are using a constant pool, it will be better to recycle somewhat larger objects than thrashing on smaller objects
I want to make the DataUtil buffers use this too - I think I can just remove the BufferedReader (which internally allocates a new buffer) because the CharacterReader implements its own buffer. Two is redundant.
It would be good if the small HTML test had more varied inputs. Otherwise the String Cache in CharacterReader is always going to have an artificially high hit rate, which might bias the performance tests.
I don't know that we need to make these pools optional via a tuning configuration. I would prefer to just get it to work well for all circumstances. Just one less thing for the developer to worry about.
In contrast to the last point however, I wonder if we should make the StringCache optional, or automatically switch it on/off depending on the platform (server or Android). I have found during perf testing that the overhead of maintaining it when not on Android is worse than the allocation cost.

Jan 07 '23 03:01 jhy

(Am getting some build failures which are from the FuzzFixes tests -- some are timing out. Those have been sensitive to the CPU allocation on the workers so am not sure if it's just GitHub actions running a bit slower, or if these changes caused any slowness. I don't believe it's the latter, as my local perf tests show improvements, but we need to dig in.)

Jan 07 '23 03:01 jhy

(The Windows builds are failing - looks like for HTTP fetches the connection length is not getting fully read in some case. Need to investigate.)

Jan 12 '23 06:01 jhy

Implement buffer recycling for CharacterReader

Before

After

Conclusion