jdk More MergeStore benchmark

Added the putBytes4 benchmark, which corresponds to StringBuilder appendNull
Optimized the putChars4/setInt/setLong series of benchmarks to reduce extra overhead and more accurately reflect performance differences.

Progress

[ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
[x] Change must not contain extraneous whitespace
[ ] Commit message must refer to an issue

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/21659/head:pull/21659
$ git checkout pull/21659

Update a local copy of the PR:
$ git checkout pull/21659
$ git pull https://git.openjdk.org/jdk.git pull/21659/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 21659

View PR using the GUI difftool:
$ git pr show -t 21659

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/21659.diff

Oct 23 '24 07:10 wenshao

:wave: Welcome back swen! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

Oct 23 '24 07:10 bridgekeeper[bot]

@wenshao This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8343629: More MergeStore benchmark

Reviewed-by: epeter

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 899 new commits pushed to the master branch:

b0c935c03ebb34f20f15dd8c7616c6c4526073cd: 8347047: Cleanup action passed to MemorySegment::reinterpret keeps old segment alive
bcefab5e55d4527a38dcab550581a734c1564608: 8342468: Improve API documentation for java.lang.classfile.constantpool
40f0a398fa9b1b39a43640973eaffb041bb7b63d: 8343342: java/io/File/GetXSpace.java fails on Windows with CD-ROM drive
021c476409c52c65cc7b40516d81dedef040fe83: 8347148: [BACKOUT] AccessFlags can be u2 in metadata
ddb58819640dc8f1930d243d6eb07ce88ef79b22: 8329549: Remove FORMAT64_MODIFIER
098afc8b7d0e7caa82999fb9d4e319ea8aed09a1: 8339113: AccessFlags can be u2 in metadata
e413fc643c4a58e3c46d81025c3ac9fbf89db4b9: 8347127: CTW fails to build after JDK-8334733
9702accdd9a25e05628d470bf248edd5d80c0c4d: 8175709: DateTimeFormatterBuilder.appendZoneId() has misleading JavaDoc
030149fec4f175e5571e053fa56d2921d95c6b13: 8334644: Automate javax/print/attribute/PageRangesException.java
c8a9dd3a027781d006850c028714a62903c487d5: 8346609: Improve MemorySegment.toString
... and 889 more: https://git.openjdk.org/jdk/compare/8cb122119409fb13b4b9b2e74851207734d5c198...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

Oct 23 '24 07:10 openjdk[bot]

@wenshao The following label will be automatically applied to this pull request:

hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

Oct 23 '24 07:10 openjdk[bot]

Webrevs

Nov 05 '24 15:11 mlbridge[bot]

@eme64

Below are the performance numbers running under AMD EPYC™ Genoa (x64), where the scenario of putBytes4GetBytes is

"null".getBytes(0, 4, bytes4, off);

Is it possible to do MergeStore in this scenario?

Benchmark                          Mode  Cnt      Score       Error  Units
MergeStoreBench.getCharB           avgt    5   6038.532 ±   533.982  ns/op
MergeStoreBench.getCharBU          avgt    5   4923.182 ±   163.872  ns/op
MergeStoreBench.getCharBV          avgt    5   3111.268 ±    84.077  ns/op
MergeStoreBench.getCharC           avgt    5   2245.270 ±    33.559  ns/op
MergeStoreBench.getCharL           avgt    5   6109.519 ±   249.512  ns/op
MergeStoreBench.getCharLU          avgt    5   4552.425 ±   161.933  ns/op
MergeStoreBench.getCharLV          avgt    5   2239.866 ±    91.853  ns/op
MergeStoreBench.getIntB            avgt    5   8163.035 ±   137.565  ns/op
MergeStoreBench.getIntBU           avgt    5   9136.199 ±   259.491  ns/op
MergeStoreBench.getIntBV           avgt    5    314.123 ±     4.510  ns/op
MergeStoreBench.getIntL            avgt    5   7879.011 ±    10.759  ns/op
MergeStoreBench.getIntLU           avgt    5   8968.715 ±   268.414  ns/op
MergeStoreBench.getIntLV           avgt    5   2228.228 ±     1.510  ns/op
MergeStoreBench.getIntRB           avgt    5   8618.141 ±    22.545  ns/op
MergeStoreBench.getIntRBU          avgt    5  11239.977 ±   447.754  ns/op
MergeStoreBench.getIntRL           avgt    5   9060.754 ±   236.147  ns/op
MergeStoreBench.getIntRLU          avgt    5   9365.050 ±   154.357  ns/op
MergeStoreBench.getIntRU           avgt    5   2540.704 ±    75.198  ns/op
MergeStoreBench.getIntU            avgt    5   2508.954 ±    74.999  ns/op
MergeStoreBench.getLongB           avgt    5  24940.668 ± 16857.311  ns/op
MergeStoreBench.getLongBU          avgt    5  14126.468 ±   329.241  ns/op
MergeStoreBench.getLongBV          avgt    5    607.128 ±    23.775  ns/op
MergeStoreBench.getLongL           avgt    5  25519.679 ± 15393.727  ns/op
MergeStoreBench.getLongLU          avgt    5  14598.271 ±   481.158  ns/op
MergeStoreBench.getLongLV          avgt    5   2227.659 ±    16.334  ns/op
MergeStoreBench.getLongRB          avgt    5  25158.839 ± 18209.451  ns/op
MergeStoreBench.getLongRBU         avgt    5  14005.082 ±   208.154  ns/op
MergeStoreBench.getLongRL          avgt    5  25303.319 ± 14775.524  ns/op
MergeStoreBench.getLongRLU         avgt    5  14481.847 ±   309.623  ns/op
MergeStoreBench.getLongRU          avgt    5   3065.744 ±    15.405  ns/op
MergeStoreBench.getLongU           avgt    5   3048.522 ±     0.704  ns/op
MergeStoreBench.putBytes4          avgt    5    933.283 ±     6.197  ns/op
MergeStoreBench.putBytes4GetBytes  avgt    5   5917.932 ±   199.901  ns/op
MergeStoreBench.putBytes4U         avgt    5    944.097 ±    25.902  ns/op
MergeStoreBench.putBytes4X         avgt    5    944.714 ±    18.924  ns/op
MergeStoreBench.putChars4B         avgt    5   5679.262 ±   154.030  ns/op
MergeStoreBench.putChars4BU        avgt    5   1143.133 ±     4.250  ns/op
MergeStoreBench.putChars4BV        avgt    5   4530.941 ±   124.318  ns/op
MergeStoreBench.putChars4C         avgt    5   1138.541 ±    27.843  ns/op
MergeStoreBench.putChars4L         avgt    5   5647.885 ±   112.363  ns/op
MergeStoreBench.putChars4LU        avgt    5   1142.501 ±     4.400  ns/op
MergeStoreBench.putChars4LV        avgt    5   1143.770 ±     3.435  ns/op
MergeStoreBench.putChars4S         avgt    5   1141.919 ±    36.528  ns/op
MergeStoreBench.setCharBS          avgt    5   6114.143 ±   144.826  ns/op
MergeStoreBench.setCharBV          avgt    5   3607.599 ±    87.720  ns/op
MergeStoreBench.setCharC           avgt    5   4510.196 ±     5.445  ns/op
MergeStoreBench.setCharLS          avgt    5   5641.424 ±   195.167  ns/op
MergeStoreBench.setCharLV          avgt    5   2267.712 ±    40.752  ns/op
MergeStoreBench.setIntB            avgt    5   8049.368 ±   233.618  ns/op
MergeStoreBench.setIntBU           avgt    5  18052.279 ±  2428.567  ns/op
MergeStoreBench.setIntBV           avgt    5   3287.905 ±    63.375  ns/op
MergeStoreBench.setIntL            avgt    5   2135.887 ±    62.601  ns/op
MergeStoreBench.setIntLU           avgt    5   4795.636 ±    74.974  ns/op
MergeStoreBench.setIntLV           avgt    5   2154.363 ±    81.324  ns/op
MergeStoreBench.setIntRB           avgt    5  13895.941 ±  7981.782  ns/op
MergeStoreBench.setIntRBU          avgt    5  14756.267 ±  1585.571  ns/op
MergeStoreBench.setIntRL           avgt    5   3284.792 ±    37.939  ns/op
MergeStoreBench.setIntRLU          avgt    5   5958.555 ±    27.404  ns/op
MergeStoreBench.setIntRU           avgt    5   5983.119 ±    79.627  ns/op
MergeStoreBench.setIntU            avgt    5   4848.655 ±   168.466  ns/op
MergeStoreBench.setLongB           avgt    5  31871.401 ±  1233.822  ns/op
MergeStoreBench.setLongBU          avgt    5  25704.975 ±  5105.792  ns/op
MergeStoreBench.setLongBV          avgt    5   2199.367 ±    69.511  ns/op
MergeStoreBench.setLongL           avgt    5   5486.926 ±    30.874  ns/op
MergeStoreBench.setLongLU          avgt    5   4503.212 ±    81.635  ns/op
MergeStoreBench.setLongLV          avgt    5   2144.943 ±    38.944  ns/op
MergeStoreBench.setLongRB          avgt    5  30338.353 ±  1631.512  ns/op
MergeStoreBench.setLongRBU         avgt    5  25025.442 ±  2690.138  ns/op
MergeStoreBench.setLongRL          avgt    5   4553.245 ±   128.721  ns/op
MergeStoreBench.setLongRLU         avgt    5   4793.427 ±     1.474  ns/op
MergeStoreBench.setLongRU          avgt    5   4803.963 ±    74.017  ns/op
MergeStoreBench.setLongU           avgt    5   4564.326 ±   146.283  ns/op

Nov 06 '24 00:11 wenshao

"null".getBytes(0, 4, bytes4, off);
Is it possible to do MergeStore in this scenario?

I don't know. What do the logs say? And what does it currently compile down to, i.e. what assembly instructions?

Otherwise I think this update seems reasonable.

Nov 06 '24 07:11 eme64

It would be nice if you could do some summary / explanation: which cases do still not optimize, and why?

Nov 06 '24 07:11 eme64

For that, it would be helpful if you had a run with, and one without MergeStores enabled - then we can easily compare the performance!

Nov 06 '24 07:11 eme64

You can find an example of how to do that easily here: https://github.com/openjdk/jdk/pull/19970/files#diff-9072c369f5b541ef9fca3ad8320aa59e88cc72f203c03da58100b1d111ffc324R746-R749

Nov 06 '24 07:11 eme64

"null".getBytes(0, 4, bytes4, off);
Is it possible to do MergeStore in this scenario?
I don't know. What do the logs say? And what does it currently compile down to, i.e. what assembly instructions?

Otherwise I think this update seems reasonable.

My thinking is this:

StringBuilder buf = new StringBuilder();
// ...
buf.append("null");

The calling path is as follows:

AbstractStringBuilder::append -> AbstractStringBuilder::putStringAt -> String::getBytes(byte[], int, byte) -> System::arraycopy

In this scenario, if System::arraycopy can be optimized to use putInt or putLong, performance can be improved.

It is similar in the String concatenation scenario

String f(int i) {
	return "abcd" + i;
}

Here StringConcatHelper::prepend(int, byte, byte[], String, String) is called, and then String::getBytes(byte[], int, byte) -> System::arraycopy

package java.lang;
class StringConcatHelper {
    static int prepend(int index, byte coder, byte[] buf, String value, String prefix) {
        index -= value.length();
        if (coder == String.LATIN1) {
            value.getBytes(buf, index, String.LATIN1);
            index -= prefix.length();
            prefix.getBytes(buf, index, String.LATIN1);
        } else {
            value.getBytes(buf, index, String.UTF16);
            index -= prefix.length();
            prefix.getBytes(buf, index, String.UTF16);
        }
        return index;
    }
}

Here is similar to the above, can we optimize "abcd".getBytes to putInt or putLong?

In summary, can we optimize the System::arraycopy of a stable byte[] with a length of 4 to putInt?

Nov 07 '24 01:11 wenshao

<<Hide useless information>>

Nov 09 '24 03:11 wenshao

@eme64 Why is there no noticeable difference in the performance of +/-MergeStores

What did you do to find out yourself? Did you use the trace flags to see if there is a difference in what is optimized / the output assembly code?

Nov 11 '24 07:11 eme64

After jvm_args is set correctly, +/-MergeStores has a significant performance difference.

List of tests with significant performance improvements

The putBytes series and LittleEndian-based tests show that +MergeStores has a significant performance improvement under x64.

	-MergeStores	+MergeStores	delta
MergeStoreBench.putBytes4	4475.891	929.327	381.63%
MergeStoreBench.putBytes4U	4479.133	928.502	382.40%
MergeStoreBench.putBytes4X	4477.133	929.183	381.84%
MergeStoreBench.putChars4B	9008.350	5638.550	59.76%
MergeStoreBench.putChars4BU	8961.671	1144.479	683.03%
MergeStoreBench.putChars4C	4485.308	1133.473	295.71%
MergeStoreBench.putChars4L	9013.570	5640.893	59.79%
MergeStoreBench.putChars4LU	8957.625	1142.796	683.83%
MergeStoreBench.putChars4LV	4488.698	1134.303	295.72%
MergeStoreBench.putChars4S	4485.836	1133.430	295.78%
MergeStoreBench.setIntL	15430.183	2113.544	630.06%
MergeStoreBench.setIntLU	17361.730	4783.040	262.99%
MergeStoreBench.setIntRL	16525.068	3244.126	409.38%
MergeStoreBench.setIntRLU	14401.071	5930.149	142.85%
MergeStoreBench.setLongL	31353.713	5405.189	480.07%
MergeStoreBench.setLongLU	26113.756	4287.166	509.11%
MergeStoreBench.setLongRL	27232.898	4523.658	502.01%
MergeStoreBench.setLongRLU	26196.973	4798.177	445.98%
MergeStoreBench.setLongU	4500.659	4271.225	5.37%

VarHandle is the fastest

I found that the test using VarHandler is faster than the others, whether it is BigEndian or LittleEndian VarHandler is the fastest.

	-MergeStores	+MergeStores	delta
MergeStoreBench.setCharBS	6088.826	6085.857	0.05%
MergeStoreBench.setCharBV	3596.210	3595.236	0.03%
MergeStoreBench.setCharLV	2248.493	2245.939	0.11%
MergeStoreBench.setIntBV	3239.985	3227.997	0.37%
MergeStoreBench.setIntLV	2128.975	2126.187	0.13%
MergeStoreBench.setLongBV	2168.387	2165.313	0.14%
MergeStoreBench.setLongLV	2048.737	2116.054	-3.18%

List of tests that were not improved

The speed of using BigEndian and System.arrayCopy is not improved, as shown below

	-MergeStores	+MergeStores	delta	comment
MergeStoreBench.putBytes4GetBytes	5880.164	5883.995	-0.07%	Merge Copy
MergeStoreBench.putChars4BV	4488.270	4486.457	0.04%	BigEndian
MergeStoreBench.setCharC	4519.981	4471.174	1.09%	BigEndian
MergeStoreBench.setCharLS	5619.414	5618.239	0.02%	BigEndian
MergeStoreBench.setIntB	8039.705	8045.113	-0.07%	BigEndian
MergeStoreBench.setIntBU	17884.223	17764.347	0.67%	BigEndian
MergeStoreBench.setIntRB	13786.186	13815.759	-0.21%	Reverse BigEndian
MergeStoreBench.setIntRBU	14747.463	14771.017	-0.16%	Reverse BigEndian
MergeStoreBench.setIntRU	5898.169	5875.589	0.38%	Reverse BigEndian
MergeStoreBench.setIntU	4805.170	4784.162	0.44%	BigEndian
MergeStoreBench.setLongB	31674.058	31662.483	0.04%	BigEndian
MergeStoreBench.setLongBU	25696.702	25674.394	0.09%	BigEndian
MergeStoreBench.setLongRB	29901.778	29909.501	-0.03%	Reverse BigEndian
MergeStoreBench.setLongRBU	24945.914	25005.171	-0.24%	Reverse BigEndian
MergeStoreBench.setLongRU	4797.817	4795.018	0.06%	Reverse BigEndian

Full tests

	-MergeStores	+MergeStores	delta
MergeStoreBench.putBytes4	4475.891	929.327	381.63%
MergeStoreBench.putBytes4GetBytes	5880.164	5883.995	-0.07%
MergeStoreBench.putBytes4U	4479.133	928.502	382.40%
MergeStoreBench.putBytes4X	4477.133	929.183	381.84%
MergeStoreBench.putChars4B	9008.350	5638.550	59.76%
MergeStoreBench.putChars4BU	8961.671	1144.479	683.03%
MergeStoreBench.putChars4BV	4488.270	4486.457	0.04%
MergeStoreBench.putChars4C	4485.308	1133.473	295.71%
MergeStoreBench.putChars4L	9013.570	5640.893	59.79%
MergeStoreBench.putChars4LU	8957.625	1142.796	683.83%
MergeStoreBench.putChars4LV	4488.698	1134.303	295.72%
MergeStoreBench.putChars4S	4485.836	1133.430	295.78%
MergeStoreBench.setCharBS	6088.826	6085.857	0.05%
MergeStoreBench.setCharBV	3596.210	3595.236	0.03%
MergeStoreBench.setCharC	4519.981	4471.174	1.09%
MergeStoreBench.setCharLS	5619.414	5618.239	0.02%
MergeStoreBench.setCharLV	2248.493	2245.939	0.11%
MergeStoreBench.setIntB	8039.705	8045.113	-0.07%
MergeStoreBench.setIntBU	17884.223	17764.347	0.67%
MergeStoreBench.setIntBV	3239.985	3227.997	0.37%
MergeStoreBench.setIntL	15430.183	2113.544	630.06%
MergeStoreBench.setIntLU	17361.730	4783.040	262.99%
MergeStoreBench.setIntLV	2128.975	2126.187	0.13%
MergeStoreBench.setIntRB	13786.186	13815.759	-0.21%
MergeStoreBench.setIntRBU	14747.463	14771.017	-0.16%
MergeStoreBench.setIntRL	16525.068	3244.126	409.38%
MergeStoreBench.setIntRLU	14401.071	5930.149	142.85%
MergeStoreBench.setIntRU	5898.169	5875.589	0.38%
MergeStoreBench.setIntU	4805.170	4784.162	0.44%
MergeStoreBench.setLongB	31674.058	31662.483	0.04%
MergeStoreBench.setLongBU	25696.702	25674.394	0.09%
MergeStoreBench.setLongBV	2168.387	2165.313	0.14%
MergeStoreBench.setLongL	31353.713	5405.189	480.07%
MergeStoreBench.setLongLU	26113.756	4287.166	509.11%
MergeStoreBench.setLongLV	2048.737	2116.054	-3.18%
MergeStoreBench.setLongRB	29901.778	29909.501	-0.03%
MergeStoreBench.setLongRBU	24945.914	25005.171	-0.24%
MergeStoreBench.setLongRL	27232.898	4523.658	502.01%
MergeStoreBench.setLongRLU	26196.973	4798.177	445.98%
MergeStoreBench.setLongRU	4797.817	4795.018	0.06%
MergeStoreBench.setLongU	4500.659	4271.225	5.37%

Nov 14 '24 00:11 wenshao

@eme64 Are there plans to support MergeLoad, and big-endian MergeStore on little-endian machines?

Nov 14 '24 05:11 wenshao

@wenshao I'll look at your results later.

@eme64 Are there plans to support MergeLoad, and big-endian MergeStore on little-endian machines?

These are all good ideas, and I already discussed it offline with @cl4es . I have lots of tasks I'm working on, and this is on the lowest tier of priorities for me personally. But if someone else wants to jump on that, then I can coach and review.

We could also be interested in "MergeCopy", i.e. load->store patterns. Maybe this just ends up being SuperWord again, but this time for straight line code.

Nov 14 '24 06:11 eme64

@wenshao Ah. I only just realized it: you have a lot of get benchmarks... they don't really belong to MergeStores... if anything you could put them in a separate MergeLoads benchmark!

Also: now we have lots of data here. But data alone is kind of pointless. We need analysis to see what patterns and why they get speedups.

Nov 14 '24 07:11 eme64

@eme64 I have updated the benchmark numbers above, removing the getXXX part. I have also added some analysis. I found that the test using VarHandle is the fastest, whether it is BigEndian or LittleEndian. All the BigEndian tests are not MergeStored, including the Reverse combination scenario.

Nov 18 '24 11:11 wenshao

@wenshao This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

Jan 06 '25 20:01 bridgekeeper[bot]

/integrate

Jan 08 '25 09:01 wenshao

Going to push as commit b741f3fe5b54755d19c5abeca76fdceeccafd448. Since your change was applied there have been 899 commits pushed to the master branch:

b0c935c03ebb34f20f15dd8c7616c6c4526073cd: 8347047: Cleanup action passed to MemorySegment::reinterpret keeps old segment alive
bcefab5e55d4527a38dcab550581a734c1564608: 8342468: Improve API documentation for java.lang.classfile.constantpool
40f0a398fa9b1b39a43640973eaffb041bb7b63d: 8343342: java/io/File/GetXSpace.java fails on Windows with CD-ROM drive
021c476409c52c65cc7b40516d81dedef040fe83: 8347148: [BACKOUT] AccessFlags can be u2 in metadata
ddb58819640dc8f1930d243d6eb07ce88ef79b22: 8329549: Remove FORMAT64_MODIFIER
098afc8b7d0e7caa82999fb9d4e319ea8aed09a1: 8339113: AccessFlags can be u2 in metadata
e413fc643c4a58e3c46d81025c3ac9fbf89db4b9: 8347127: CTW fails to build after JDK-8334733
9702accdd9a25e05628d470bf248edd5d80c0c4d: 8175709: DateTimeFormatterBuilder.appendZoneId() has misleading JavaDoc
030149fec4f175e5571e053fa56d2921d95c6b13: 8334644: Automate javax/print/attribute/PageRangesException.java
c8a9dd3a027781d006850c028714a62903c487d5: 8346609: Improve MemorySegment.toString
... and 889 more: https://git.openjdk.org/jdk/compare/8cb122119409fb13b4b9b2e74851207734d5c198...master

Your commit was automatically rebased without conflicts.

Jan 08 '25 09:01 openjdk[bot]

@wenshao Pushed as commit b741f3fe5b54755d19c5abeca76fdceeccafd448.

:bulb: You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Jan 08 '25 09:01 openjdk[bot]

Thanks, Emanuel Peter.

Jan 08 '25 09:01 wenshao

These are all good ideas, and I already discussed it offline with @cl4es . I have lots of tasks I'm working on, and this is on the lowest tier of priorities for me personally. But if someone else wants to jump on that, then I can coach and review.

We could also be interested in "MergeCopy", i.e. load->store patterns. Maybe this just ends up being SuperWord again, but this time for straight line code.

PR #23030 has been submitted to add support for BigEndian MergeStore

Jan 10 '25 10:01 wenshao