More MergeStore benchmark
- Added the putBytes4 benchmark, which corresponds to StringBuilder appendNull
- Optimized the putChars4/setInt/setLong series of benchmarks to reduce extra overhead and more accurately reflect performance differences.
Progress
- [ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
- [x] Change must not contain extraneous whitespace
- [ ] Commit message must refer to an issue
Reviewing
Using git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/21659/head:pull/21659
$ git checkout pull/21659
Update a local copy of the PR:
$ git checkout pull/21659
$ git pull https://git.openjdk.org/jdk.git pull/21659/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 21659
View PR using the GUI difftool:
$ git pr show -t 21659
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/21659.diff
:wave: Welcome back swen! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.
@wenshao This change now passes all automated pre-integration checks.
ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.
After integration, the commit message for the final commit will be:
8343629: More MergeStore benchmark
Reviewed-by: epeter
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.
At the time when this comment was updated there had been 899 new commits pushed to the master branch:
- b0c935c03ebb34f20f15dd8c7616c6c4526073cd: 8347047: Cleanup action passed to MemorySegment::reinterpret keeps old segment alive
- bcefab5e55d4527a38dcab550581a734c1564608: 8342468: Improve API documentation for java.lang.classfile.constantpool
- 40f0a398fa9b1b39a43640973eaffb041bb7b63d: 8343342: java/io/File/GetXSpace.java fails on Windows with CD-ROM drive
- 021c476409c52c65cc7b40516d81dedef040fe83: 8347148: [BACKOUT] AccessFlags can be u2 in metadata
- ddb58819640dc8f1930d243d6eb07ce88ef79b22: 8329549: Remove FORMAT64_MODIFIER
- 098afc8b7d0e7caa82999fb9d4e319ea8aed09a1: 8339113: AccessFlags can be u2 in metadata
- e413fc643c4a58e3c46d81025c3ac9fbf89db4b9: 8347127: CTW fails to build after JDK-8334733
- 9702accdd9a25e05628d470bf248edd5d80c0c4d: 8175709: DateTimeFormatterBuilder.appendZoneId() has misleading JavaDoc
- 030149fec4f175e5571e053fa56d2921d95c6b13: 8334644: Automate javax/print/attribute/PageRangesException.java
- c8a9dd3a027781d006850c028714a62903c487d5: 8346609: Improve MemorySegment.toString
- ... and 889 more: https://git.openjdk.org/jdk/compare/8cb122119409fb13b4b9b2e74851207734d5c198...master
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.
➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.
@wenshao The following label will be automatically applied to this pull request:
hotspot-compiler
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.
Webrevs
- 04: Full - Incremental (dce66dae)
- 03: Full - Incremental (2e88b024)
- 02: Full - Incremental (4293ced9)
- 01: Full (c6f05f20)
- 00: Full (2a572c58)
@eme64
Below are the performance numbers running under AMD EPYC™ Genoa (x64), where the scenario of putBytes4GetBytes is
"null".getBytes(0, 4, bytes4, off);
Is it possible to do MergeStore in this scenario?
Benchmark Mode Cnt Score Error Units
MergeStoreBench.getCharB avgt 5 6038.532 ± 533.982 ns/op
MergeStoreBench.getCharBU avgt 5 4923.182 ± 163.872 ns/op
MergeStoreBench.getCharBV avgt 5 3111.268 ± 84.077 ns/op
MergeStoreBench.getCharC avgt 5 2245.270 ± 33.559 ns/op
MergeStoreBench.getCharL avgt 5 6109.519 ± 249.512 ns/op
MergeStoreBench.getCharLU avgt 5 4552.425 ± 161.933 ns/op
MergeStoreBench.getCharLV avgt 5 2239.866 ± 91.853 ns/op
MergeStoreBench.getIntB avgt 5 8163.035 ± 137.565 ns/op
MergeStoreBench.getIntBU avgt 5 9136.199 ± 259.491 ns/op
MergeStoreBench.getIntBV avgt 5 314.123 ± 4.510 ns/op
MergeStoreBench.getIntL avgt 5 7879.011 ± 10.759 ns/op
MergeStoreBench.getIntLU avgt 5 8968.715 ± 268.414 ns/op
MergeStoreBench.getIntLV avgt 5 2228.228 ± 1.510 ns/op
MergeStoreBench.getIntRB avgt 5 8618.141 ± 22.545 ns/op
MergeStoreBench.getIntRBU avgt 5 11239.977 ± 447.754 ns/op
MergeStoreBench.getIntRL avgt 5 9060.754 ± 236.147 ns/op
MergeStoreBench.getIntRLU avgt 5 9365.050 ± 154.357 ns/op
MergeStoreBench.getIntRU avgt 5 2540.704 ± 75.198 ns/op
MergeStoreBench.getIntU avgt 5 2508.954 ± 74.999 ns/op
MergeStoreBench.getLongB avgt 5 24940.668 ± 16857.311 ns/op
MergeStoreBench.getLongBU avgt 5 14126.468 ± 329.241 ns/op
MergeStoreBench.getLongBV avgt 5 607.128 ± 23.775 ns/op
MergeStoreBench.getLongL avgt 5 25519.679 ± 15393.727 ns/op
MergeStoreBench.getLongLU avgt 5 14598.271 ± 481.158 ns/op
MergeStoreBench.getLongLV avgt 5 2227.659 ± 16.334 ns/op
MergeStoreBench.getLongRB avgt 5 25158.839 ± 18209.451 ns/op
MergeStoreBench.getLongRBU avgt 5 14005.082 ± 208.154 ns/op
MergeStoreBench.getLongRL avgt 5 25303.319 ± 14775.524 ns/op
MergeStoreBench.getLongRLU avgt 5 14481.847 ± 309.623 ns/op
MergeStoreBench.getLongRU avgt 5 3065.744 ± 15.405 ns/op
MergeStoreBench.getLongU avgt 5 3048.522 ± 0.704 ns/op
MergeStoreBench.putBytes4 avgt 5 933.283 ± 6.197 ns/op
MergeStoreBench.putBytes4GetBytes avgt 5 5917.932 ± 199.901 ns/op
MergeStoreBench.putBytes4U avgt 5 944.097 ± 25.902 ns/op
MergeStoreBench.putBytes4X avgt 5 944.714 ± 18.924 ns/op
MergeStoreBench.putChars4B avgt 5 5679.262 ± 154.030 ns/op
MergeStoreBench.putChars4BU avgt 5 1143.133 ± 4.250 ns/op
MergeStoreBench.putChars4BV avgt 5 4530.941 ± 124.318 ns/op
MergeStoreBench.putChars4C avgt 5 1138.541 ± 27.843 ns/op
MergeStoreBench.putChars4L avgt 5 5647.885 ± 112.363 ns/op
MergeStoreBench.putChars4LU avgt 5 1142.501 ± 4.400 ns/op
MergeStoreBench.putChars4LV avgt 5 1143.770 ± 3.435 ns/op
MergeStoreBench.putChars4S avgt 5 1141.919 ± 36.528 ns/op
MergeStoreBench.setCharBS avgt 5 6114.143 ± 144.826 ns/op
MergeStoreBench.setCharBV avgt 5 3607.599 ± 87.720 ns/op
MergeStoreBench.setCharC avgt 5 4510.196 ± 5.445 ns/op
MergeStoreBench.setCharLS avgt 5 5641.424 ± 195.167 ns/op
MergeStoreBench.setCharLV avgt 5 2267.712 ± 40.752 ns/op
MergeStoreBench.setIntB avgt 5 8049.368 ± 233.618 ns/op
MergeStoreBench.setIntBU avgt 5 18052.279 ± 2428.567 ns/op
MergeStoreBench.setIntBV avgt 5 3287.905 ± 63.375 ns/op
MergeStoreBench.setIntL avgt 5 2135.887 ± 62.601 ns/op
MergeStoreBench.setIntLU avgt 5 4795.636 ± 74.974 ns/op
MergeStoreBench.setIntLV avgt 5 2154.363 ± 81.324 ns/op
MergeStoreBench.setIntRB avgt 5 13895.941 ± 7981.782 ns/op
MergeStoreBench.setIntRBU avgt 5 14756.267 ± 1585.571 ns/op
MergeStoreBench.setIntRL avgt 5 3284.792 ± 37.939 ns/op
MergeStoreBench.setIntRLU avgt 5 5958.555 ± 27.404 ns/op
MergeStoreBench.setIntRU avgt 5 5983.119 ± 79.627 ns/op
MergeStoreBench.setIntU avgt 5 4848.655 ± 168.466 ns/op
MergeStoreBench.setLongB avgt 5 31871.401 ± 1233.822 ns/op
MergeStoreBench.setLongBU avgt 5 25704.975 ± 5105.792 ns/op
MergeStoreBench.setLongBV avgt 5 2199.367 ± 69.511 ns/op
MergeStoreBench.setLongL avgt 5 5486.926 ± 30.874 ns/op
MergeStoreBench.setLongLU avgt 5 4503.212 ± 81.635 ns/op
MergeStoreBench.setLongLV avgt 5 2144.943 ± 38.944 ns/op
MergeStoreBench.setLongRB avgt 5 30338.353 ± 1631.512 ns/op
MergeStoreBench.setLongRBU avgt 5 25025.442 ± 2690.138 ns/op
MergeStoreBench.setLongRL avgt 5 4553.245 ± 128.721 ns/op
MergeStoreBench.setLongRLU avgt 5 4793.427 ± 1.474 ns/op
MergeStoreBench.setLongRU avgt 5 4803.963 ± 74.017 ns/op
MergeStoreBench.setLongU avgt 5 4564.326 ± 146.283 ns/op
"null".getBytes(0, 4, bytes4, off);Is it possible to do MergeStore in this scenario?
I don't know. What do the logs say? And what does it currently compile down to, i.e. what assembly instructions?
Otherwise I think this update seems reasonable.
It would be nice if you could do some summary / explanation: which cases do still not optimize, and why?
For that, it would be helpful if you had a run with, and one without MergeStores enabled - then we can easily compare the performance!
You can find an example of how to do that easily here: https://github.com/openjdk/jdk/pull/19970/files#diff-9072c369f5b541ef9fca3ad8320aa59e88cc72f203c03da58100b1d111ffc324R746-R749
"null".getBytes(0, 4, bytes4, off);Is it possible to do MergeStore in this scenario?
I don't know. What do the logs say? And what does it currently compile down to, i.e. what assembly instructions?
Otherwise I think this update seems reasonable.
My thinking is this:
StringBuilder buf = new StringBuilder();
// ...
buf.append("null");
The calling path is as follows:
AbstractStringBuilder::append -> AbstractStringBuilder::putStringAt -> String::getBytes(byte[], int, byte) -> System::arraycopy
In this scenario, if System::arraycopy can be optimized to use putInt or putLong, performance can be improved.
It is similar in the String concatenation scenario
String f(int i) {
return "abcd" + i;
}
Here StringConcatHelper::prepend(int, byte, byte[], String, String) is called, and then String::getBytes(byte[], int, byte) -> System::arraycopy
package java.lang;
class StringConcatHelper {
static int prepend(int index, byte coder, byte[] buf, String value, String prefix) {
index -= value.length();
if (coder == String.LATIN1) {
value.getBytes(buf, index, String.LATIN1);
index -= prefix.length();
prefix.getBytes(buf, index, String.LATIN1);
} else {
value.getBytes(buf, index, String.UTF16);
index -= prefix.length();
prefix.getBytes(buf, index, String.UTF16);
}
return index;
}
}
Here is similar to the above, can we optimize "abcd".getBytes to putInt or putLong?
In summary, can we optimize the System::arraycopy of a stable byte[] with a length of 4 to putInt?
<<Hide useless information>>
@eme64 Why is there no noticeable difference in the performance of +/-MergeStores
What did you do to find out yourself? Did you use the trace flags to see if there is a difference in what is optimized / the output assembly code?
After jvm_args is set correctly, +/-MergeStores has a significant performance difference.
List of tests with significant performance improvements
The putBytes series and LittleEndian-based tests show that +MergeStores has a significant performance improvement under x64.
| -MergeStores | +MergeStores | delta | |
|---|---|---|---|
| MergeStoreBench.putBytes4 | 4475.891 | 929.327 | 381.63% |
| MergeStoreBench.putBytes4U | 4479.133 | 928.502 | 382.40% |
| MergeStoreBench.putBytes4X | 4477.133 | 929.183 | 381.84% |
| MergeStoreBench.putChars4B | 9008.350 | 5638.550 | 59.76% |
| MergeStoreBench.putChars4BU | 8961.671 | 1144.479 | 683.03% |
| MergeStoreBench.putChars4C | 4485.308 | 1133.473 | 295.71% |
| MergeStoreBench.putChars4L | 9013.570 | 5640.893 | 59.79% |
| MergeStoreBench.putChars4LU | 8957.625 | 1142.796 | 683.83% |
| MergeStoreBench.putChars4LV | 4488.698 | 1134.303 | 295.72% |
| MergeStoreBench.putChars4S | 4485.836 | 1133.430 | 295.78% |
| MergeStoreBench.setIntL | 15430.183 | 2113.544 | 630.06% |
| MergeStoreBench.setIntLU | 17361.730 | 4783.040 | 262.99% |
| MergeStoreBench.setIntRL | 16525.068 | 3244.126 | 409.38% |
| MergeStoreBench.setIntRLU | 14401.071 | 5930.149 | 142.85% |
| MergeStoreBench.setLongL | 31353.713 | 5405.189 | 480.07% |
| MergeStoreBench.setLongLU | 26113.756 | 4287.166 | 509.11% |
| MergeStoreBench.setLongRL | 27232.898 | 4523.658 | 502.01% |
| MergeStoreBench.setLongRLU | 26196.973 | 4798.177 | 445.98% |
| MergeStoreBench.setLongU | 4500.659 | 4271.225 | 5.37% |
VarHandle is the fastest
I found that the test using VarHandler is faster than the others, whether it is BigEndian or LittleEndian VarHandler is the fastest.
| -MergeStores | +MergeStores | delta | |
|---|---|---|---|
| MergeStoreBench.setCharBS | 6088.826 | 6085.857 | 0.05% |
| MergeStoreBench.setCharBV | 3596.210 | 3595.236 | 0.03% |
| MergeStoreBench.setCharLV | 2248.493 | 2245.939 | 0.11% |
| MergeStoreBench.setIntBV | 3239.985 | 3227.997 | 0.37% |
| MergeStoreBench.setIntLV | 2128.975 | 2126.187 | 0.13% |
| MergeStoreBench.setLongBV | 2168.387 | 2165.313 | 0.14% |
| MergeStoreBench.setLongLV | 2048.737 | 2116.054 | -3.18% |
List of tests that were not improved
The speed of using BigEndian and System.arrayCopy is not improved, as shown below
| -MergeStores | +MergeStores | delta | comment | |
|---|---|---|---|---|
| MergeStoreBench.putBytes4GetBytes | 5880.164 | 5883.995 | -0.07% | Merge Copy |
| MergeStoreBench.putChars4BV | 4488.270 | 4486.457 | 0.04% | BigEndian |
| MergeStoreBench.setCharC | 4519.981 | 4471.174 | 1.09% | BigEndian |
| MergeStoreBench.setCharLS | 5619.414 | 5618.239 | 0.02% | BigEndian |
| MergeStoreBench.setIntB | 8039.705 | 8045.113 | -0.07% | BigEndian |
| MergeStoreBench.setIntBU | 17884.223 | 17764.347 | 0.67% | BigEndian |
| MergeStoreBench.setIntRB | 13786.186 | 13815.759 | -0.21% | Reverse BigEndian |
| MergeStoreBench.setIntRBU | 14747.463 | 14771.017 | -0.16% | Reverse BigEndian |
| MergeStoreBench.setIntRU | 5898.169 | 5875.589 | 0.38% | Reverse BigEndian |
| MergeStoreBench.setIntU | 4805.170 | 4784.162 | 0.44% | BigEndian |
| MergeStoreBench.setLongB | 31674.058 | 31662.483 | 0.04% | BigEndian |
| MergeStoreBench.setLongBU | 25696.702 | 25674.394 | 0.09% | BigEndian |
| MergeStoreBench.setLongRB | 29901.778 | 29909.501 | -0.03% | Reverse BigEndian |
| MergeStoreBench.setLongRBU | 24945.914 | 25005.171 | -0.24% | Reverse BigEndian |
| MergeStoreBench.setLongRU | 4797.817 | 4795.018 | 0.06% | Reverse BigEndian |
Full tests
| -MergeStores | +MergeStores | delta | |
|---|---|---|---|
| MergeStoreBench.putBytes4 | 4475.891 | 929.327 | 381.63% |
| MergeStoreBench.putBytes4GetBytes | 5880.164 | 5883.995 | -0.07% |
| MergeStoreBench.putBytes4U | 4479.133 | 928.502 | 382.40% |
| MergeStoreBench.putBytes4X | 4477.133 | 929.183 | 381.84% |
| MergeStoreBench.putChars4B | 9008.350 | 5638.550 | 59.76% |
| MergeStoreBench.putChars4BU | 8961.671 | 1144.479 | 683.03% |
| MergeStoreBench.putChars4BV | 4488.270 | 4486.457 | 0.04% |
| MergeStoreBench.putChars4C | 4485.308 | 1133.473 | 295.71% |
| MergeStoreBench.putChars4L | 9013.570 | 5640.893 | 59.79% |
| MergeStoreBench.putChars4LU | 8957.625 | 1142.796 | 683.83% |
| MergeStoreBench.putChars4LV | 4488.698 | 1134.303 | 295.72% |
| MergeStoreBench.putChars4S | 4485.836 | 1133.430 | 295.78% |
| MergeStoreBench.setCharBS | 6088.826 | 6085.857 | 0.05% |
| MergeStoreBench.setCharBV | 3596.210 | 3595.236 | 0.03% |
| MergeStoreBench.setCharC | 4519.981 | 4471.174 | 1.09% |
| MergeStoreBench.setCharLS | 5619.414 | 5618.239 | 0.02% |
| MergeStoreBench.setCharLV | 2248.493 | 2245.939 | 0.11% |
| MergeStoreBench.setIntB | 8039.705 | 8045.113 | -0.07% |
| MergeStoreBench.setIntBU | 17884.223 | 17764.347 | 0.67% |
| MergeStoreBench.setIntBV | 3239.985 | 3227.997 | 0.37% |
| MergeStoreBench.setIntL | 15430.183 | 2113.544 | 630.06% |
| MergeStoreBench.setIntLU | 17361.730 | 4783.040 | 262.99% |
| MergeStoreBench.setIntLV | 2128.975 | 2126.187 | 0.13% |
| MergeStoreBench.setIntRB | 13786.186 | 13815.759 | -0.21% |
| MergeStoreBench.setIntRBU | 14747.463 | 14771.017 | -0.16% |
| MergeStoreBench.setIntRL | 16525.068 | 3244.126 | 409.38% |
| MergeStoreBench.setIntRLU | 14401.071 | 5930.149 | 142.85% |
| MergeStoreBench.setIntRU | 5898.169 | 5875.589 | 0.38% |
| MergeStoreBench.setIntU | 4805.170 | 4784.162 | 0.44% |
| MergeStoreBench.setLongB | 31674.058 | 31662.483 | 0.04% |
| MergeStoreBench.setLongBU | 25696.702 | 25674.394 | 0.09% |
| MergeStoreBench.setLongBV | 2168.387 | 2165.313 | 0.14% |
| MergeStoreBench.setLongL | 31353.713 | 5405.189 | 480.07% |
| MergeStoreBench.setLongLU | 26113.756 | 4287.166 | 509.11% |
| MergeStoreBench.setLongLV | 2048.737 | 2116.054 | -3.18% |
| MergeStoreBench.setLongRB | 29901.778 | 29909.501 | -0.03% |
| MergeStoreBench.setLongRBU | 24945.914 | 25005.171 | -0.24% |
| MergeStoreBench.setLongRL | 27232.898 | 4523.658 | 502.01% |
| MergeStoreBench.setLongRLU | 26196.973 | 4798.177 | 445.98% |
| MergeStoreBench.setLongRU | 4797.817 | 4795.018 | 0.06% |
| MergeStoreBench.setLongU | 4500.659 | 4271.225 | 5.37% |
@eme64 Are there plans to support MergeLoad, and big-endian MergeStore on little-endian machines?
@wenshao I'll look at your results later.
@eme64 Are there plans to support MergeLoad, and big-endian MergeStore on little-endian machines?
These are all good ideas, and I already discussed it offline with @cl4es . I have lots of tasks I'm working on, and this is on the lowest tier of priorities for me personally. But if someone else wants to jump on that, then I can coach and review.
We could also be interested in "MergeCopy", i.e. load->store patterns. Maybe this just ends up being SuperWord again, but this time for straight line code.
@wenshao Ah. I only just realized it: you have a lot of get benchmarks... they don't really belong to MergeStores... if anything you could put them in a separate MergeLoads benchmark!
Also: now we have lots of data here. But data alone is kind of pointless. We need analysis to see what patterns and why they get speedups.
@eme64 I have updated the benchmark numbers above, removing the getXXX part. I have also added some analysis. I found that the test using VarHandle is the fastest, whether it is BigEndian or LittleEndian. All the BigEndian tests are not MergeStored, including the Reverse combination scenario.
@wenshao This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!
/integrate
Going to push as commit b741f3fe5b54755d19c5abeca76fdceeccafd448.
Since your change was applied there have been 899 commits pushed to the master branch:
- b0c935c03ebb34f20f15dd8c7616c6c4526073cd: 8347047: Cleanup action passed to MemorySegment::reinterpret keeps old segment alive
- bcefab5e55d4527a38dcab550581a734c1564608: 8342468: Improve API documentation for java.lang.classfile.constantpool
- 40f0a398fa9b1b39a43640973eaffb041bb7b63d: 8343342: java/io/File/GetXSpace.java fails on Windows with CD-ROM drive
- 021c476409c52c65cc7b40516d81dedef040fe83: 8347148: [BACKOUT] AccessFlags can be u2 in metadata
- ddb58819640dc8f1930d243d6eb07ce88ef79b22: 8329549: Remove FORMAT64_MODIFIER
- 098afc8b7d0e7caa82999fb9d4e319ea8aed09a1: 8339113: AccessFlags can be u2 in metadata
- e413fc643c4a58e3c46d81025c3ac9fbf89db4b9: 8347127: CTW fails to build after JDK-8334733
- 9702accdd9a25e05628d470bf248edd5d80c0c4d: 8175709: DateTimeFormatterBuilder.appendZoneId() has misleading JavaDoc
- 030149fec4f175e5571e053fa56d2921d95c6b13: 8334644: Automate javax/print/attribute/PageRangesException.java
- c8a9dd3a027781d006850c028714a62903c487d5: 8346609: Improve MemorySegment.toString
- ... and 889 more: https://git.openjdk.org/jdk/compare/8cb122119409fb13b4b9b2e74851207734d5c198...master
Your commit was automatically rebased without conflicts.
@wenshao Pushed as commit b741f3fe5b54755d19c5abeca76fdceeccafd448.
:bulb: You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.
Thanks, Emanuel Peter.
These are all good ideas, and I already discussed it offline with @cl4es . I have lots of tasks I'm working on, and this is on the lowest tier of priorities for me personally. But if someone else wants to jump on that, then I can coach and review.
We could also be interested in "MergeCopy", i.e. load->store patterns. Maybe this just ends up being SuperWord again, but this time for straight line code.
PR #23030 has been submitted to add support for BigEndian MergeStore