jdk 8331311: C2: Big Endian Port of 8318446: optimize stores into primitive arrays by combining values into larger store

This pr adds a few tweaks to JDK-8318446 which allows enabling it also on big endian platforms (e.g. AIX, S390). JDK-8318446 introduced a C2 optimization to replace consecutive stores to a primitive array with just one store.

By example (from TestMergeStores.java):

    static Object[] test2a(byte[] a, int offset, long v) {
        if (IS_BIG_ENDIAN) {
            a[offset + 0] = (byte)(v >> 56);
            a[offset + 1] = (byte)(v >> 48);
            a[offset + 2] = (byte)(v >> 40);
            a[offset + 3] = (byte)(v >> 32);
            a[offset + 4] = (byte)(v >> 24);
            a[offset + 5] = (byte)(v >> 16);
            a[offset + 6] = (byte)(v >> 8);
            a[offset + 7] = (byte)(v >> 0);
        } else {
            a[offset + 0] = (byte)(v >> 0);
            a[offset + 1] = (byte)(v >> 8);
            a[offset + 2] = (byte)(v >> 16);
            a[offset + 3] = (byte)(v >> 24);
            a[offset + 4] = (byte)(v >> 32);
            a[offset + 5] = (byte)(v >> 40);
            a[offset + 6] = (byte)(v >> 48);
            a[offset + 7] = (byte)(v >> 56);
        }
        return new Object[]{ a };
    }

Depending on the endianess 8 bytes are stored into an array. The order of the stores is the same as the order of an 8-byte-store therefore 8 1-byte-stores can be replaced with just one 8-byte-store (if there aren't too many range checks).

Additionally I've fixed a few comments and a test bug.

The optimization seems to be a little bit more effective on big endian platforms.

Again by example:

    static Object[] test800a(byte[] a, int offset, long v) {
        if (IS_BIG_ENDIAN) {
            a[offset + 0] = (byte)(v >> 40); // Removed from candidate list
            a[offset + 1] = (byte)(v >> 32); // Removed from candidate list
            a[offset + 2] = (byte)(v >> 24); // Merged
            a[offset + 3] = (byte)(v >> 16); // Merged
            a[offset + 4] = (byte)(v >> 8);  // Merged
            a[offset + 5] = (byte)(v >> 0);  // Merged
        } else {
            a[offset + 0] = (byte)(v >> 0);  // Removed from candidate list
            a[offset + 1] = (byte)(v >> 8);  // Removed from candidate list
            a[offset + 2] = (byte)(v >> 16); // Not merged
            a[offset + 3] = (byte)(v >> 24); // Not merged
            a[offset + 4] = (byte)(v >> 32); // Not merged
            a[offset + 5] = (byte)(v >> 40); // Not merged
        }
        return new Object[]{ a };
    }

The sequence of candidate stores begins at the lowest store (in Memory def-use order) and is trimmed to a power of 2 removing higher stores if necessary. On little endian platforms this removes the least significant bytes to be stored. Merging would require a right shift of the input value. While possible this is currently not done. With big endian order the stores of the more significant bytes are removed and the merge succeeds because no shift is needed.

I introduced new platform attributes little-endian, big-endian to the IR testing framework to be able to adapt IR matching rules to this difference.

Testing:

TestMergeStores.java on AIX and S390.

JTReg tests: tier1-4 of hotspot and jdk. All of Langtools and jaxp. JCK, SPECjvm2008, SPECjbb2015, Renaissance Suite, and SAP specific tests. Testing was done with fastdebug builds on the main platforms and also on Linux/PPC64le and AIX.

Progress

[ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
[x] Change must not contain extraneous whitespace
[x] Commit message must refer to an issue

Issue

JDK-8331311: C2: Big Endian Port of 8318446: optimize stores into primitive arrays by combining values into larger store (Bug - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/19218/head:pull/19218
$ git checkout pull/19218

Update a local copy of the PR:
$ git checkout pull/19218
$ git pull https://git.openjdk.org/jdk.git pull/19218/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 19218

View PR using the GUI difftool:
$ git pr show -t 19218

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/19218.diff

Webrev

Link to Webrev Comment

May 13 '24 15:05 reinrich

:wave: Welcome back rrich! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

May 13 '24 15:05 bridgekeeper[bot]

@reinrich This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8331311: C2: Big Endian Port of 8318446: optimize stores into primitive arrays by combining values into larger store

Reviewed-by: epeter, kvn

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 33 new commits pushed to the master branch:

2a37764e7428d579a3080e62681f1c9c9f816c1e: 8333743: Change .jcheck/conf branches property to match valid branches
75dc2f8518d0adea30f7065d6732b807c0220756: 8330182: Start of release updates for JDK 24
054362abe040938b87eb1a1cab8a0a94540e0667: 8332550: [macos] Voice Over: java.awt.IllegalComponentStateException: component must be showing on the screen to determine its location
9b436d048ec92f74ec6812ae20fde21751927d4b: 8333674: Disable CollectorPolicy.young_min_ergo_vm for PPC64
487c4771818999749bfd507ab85777795bba0832: 8333647: C2 SuperWord: some additional PopulateIndex tests
d02cb742f79e88c6438ca58a6357fe432fb286cb: 8333270: HandlersOnComplexResetUpdate and HandlersOnComplexUpdate tests fail with "Unexpected reference" if timeoutFactor is less than 1/3
02f240415cbda5f67a91af50d5974fb001104170: 8333560: -Xlint:restricted does not work with --release
606df441410a69034b4c113e85ce21937d1a0808: 8332670: C1 clone intrinsic needs memory barriers
33fd6ae98638d2a4b33d18cc4acee4f0daaa9b35: 8333622: ubsan: relocInfo_x86.cpp:101:56: runtime error: pointer index expression with base (-1) overflowed
8de5d2014a87d58d389eb8400f619d1b1fa3abe7: 8332865: ubsan: os::attempt_reserve_memory_between reports overflow
... and 23 more: https://git.openjdk.org/jdk/compare/326dbb1b139dd1ec1b8605339b91697cdf49da9a...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

May 13 '24 15:05 openjdk[bot]

@reinrich The following label will be automatically applied to this pull request:

hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

May 13 '24 15:05 openjdk[bot]

@offamitkumar you can put this through your testing if you like. It should solve the issues with test/hotspot/jtreg/compiler/c2/TestMergeStores.java also for s390.

May 13 '24 15:05 reinrich

@reinrich test is passing on s390x with your change. tier1 test are in progress.

Update: tier1 test are also clean on s390x;

May 13 '24 16:05 offamitkumar

Webrevs

May 14 '24 13:05 mlbridge[bot]

@reinrich thanks for taking this up! Just did a quick scan of the tests. I think it could be good to have both big/small endian tests run on both big/small endian machines, but only expect IR rules to pass if the test and platform are expected to optimize. This just makes sure that the logic is correct, and does not optimize the wrong cases, producing wrong results.

May 14 '24 16:05 eme64

It's not obvious to me why something like

            a[offset + 2] = (byte)(v >> 16); // Not merged
            a[offset + 3] = (byte)(v >> 24); // Not merged
            a[offset + 4] = (byte)(v >> 32); // Not merged
            a[offset + 5] = (byte)(v >> 40); // Not merged

can't be merged. Is it because you only use vas a possible 32-bit value? Why not use something like the following pseudo-code?

int bytes2word(byte b1, byte b2, byte b3, byte b4) {
  return (b1 & 0xff) << 24 | (b2 & 0xff) << 16 | (b3 & 0xff) << 8 | (b4 & 0xff);
}
// Substituting in the values from the example:
int big_endian = bytes2word((byte)(v >> 40), (byte)(v >> 32), (byte)(v >> 24), (byte)(v >> 16));
int little_endian = bytes2word((byte)(v >> 16), (byte)(v >> 24), (byte)(v >> 32), (byte)(v >> 40));

May 14 '24 21:05 dean-long

In other words, it seems like it could work for arbitrary byte values if the merged value was computed from those individual values. They wouldn't need to be shifted values.

            a[offset + 0] = (byte)0x1;
            a[offset + 1] = (byte)(0x2;
            a[offset + 2] = (byte)0x3;
            a[offset + 3] = (byte)(0x4;

The example above would either write 0x01020304 or 0x04030201 depending on the endianness.

May 14 '24 21:05 dean-long

It's not obvious to me why something like

            a[offset + 2] = (byte)(v >> 16); // Not merged
            a[offset + 3] = (byte)(v >> 24); // Not merged
            a[offset + 4] = (byte)(v >> 32); // Not merged
            a[offset + 5] = (byte)(v >> 40); // Not merged

can't be merged.

The stores could be merged to the following pseudo code:

  *(int*)&a[offset + 2] = (int)(v >> 16); // Merged

The current logic doesn't accept the right shift here. I think at that location we can always accept merged_input_value asserting that it is a right shift of base_last since the is_adjacent_input_pair checks succeeded before. I haven't tried it though.

I'll clarify the synopsis of this pr and the comment in TestMergeStores.java.

May 15 '24 07:05 reinrich

@dean-long @reinrich Yes, I guess that is a generalization that could be made. It would require a lot more tests to make sure all combinations are checked. I would suggest doing that in a separate RFE to keep things simple and reviewable.

May 15 '24 07:05 eme64

Thanks for looking at the pr.

Just did a quick scan of the tests. I think it could be good to have both big/small endian tests run on both big/small endian machines, but only expect IR rules to pass if the test and platform are expected to optimize. This just makes sure that the logic is correct, and does not optimize the wrong cases, producing wrong results.

I've done that for test2 and introduced test2BE. Is that want you mean?

May 15 '24 14:05 reinrich

Test error is unrelated to the changes. Upload of test results failed: Error: Failed to CreateArtifact: Failed to make request after 5 attempts: Request timeout: /twirp/github.actions.results.api.v1.ArtifactService/CreateArtifact

May 16 '24 05:05 reinrich

@reinrich please ping me again to ask if testing is ok before you integrate ;)

May 24 '24 08:05 eme64

@reinrich please ping me again to ask if testing is ok before you integrate ;)

Thanks for picking this up again. I quickly wanted to let you know that I'm out of office. I will be back in a week.

May 24 '24 15:05 reinrich

@reinrich please still wait until the JDK24 fork on Thrusday to integrate, so that we do not have to backport possible regression fixes - I had 3 or 4 with my original patch ;)

Jun 04 '24 16:06 eme64

I'm running testing again, but the code looks good now!

I just had another idea: Could we use some sort of "byte reverse / shuffle" operation to do these use cases for both big/little-endian?
        storeBytes(bytes, offset, (byte)(value >> 8),
                                  (byte)(value >> 0));

        storeBytes(bytes, offset, (byte)(value >> 0),
                                  (byte)(value >> 8));
Not sure if that would be profitable or even available on all platforms. Could be a future RFE someone can work on after this. What do you think? It might make performance more predictable across platforms.

You mean to combine the stores even if the explicit ordering does not match the ordering of the store instruction, adding a ReverseBytes[SIL]Node iff supported in that case, right? I've been thinking about this, too. In my opinion it would be worthwhile.

Jun 05 '24 07:06 reinrich

@reinrich please still wait until the JDK24 fork on Thrusday to integrate, so that we do not have to backport possible regression fixes - I had 3 or 4 with my original patch ;)

Thanks for the reviews @eme64 and @vnkozlov! I'll integrate after the code split if more local testing is successful.

Jun 05 '24 07:06 reinrich

I did another round of testing on s390x. looks good.

Jun 06 '24 07:06 offamitkumar

I did another round of testing on s390x. looks good.

Thanks Amit.

Jun 06 '24 08:06 reinrich

/integrate

Jun 07 '24 06:06 reinrich

Going to push as commit f7862bd6b9994814c6dfd43d471122408601f288. Since your change was applied there have been 38 commits pushed to the master branch:

b4beda21b487886b022e04766e140e6d1df1038a: 8332537: C2: High memory usage reported for compiler/loopopts/superword/TestAlignVectorFuzzer.java
e5383d710c0727181a2f0b569a881de2492e3683: 8333713: C2 SuperWord: cleanup in vectornode.cpp/hpp
944aeb81b16e3e7a3019cafdefe67b797fa6be96: 8325155: C2 SuperWord: remove alignment boundaries
d8af58941b5dedb9774c0971895c4924e57ac28b: 8026127: Deflater/Inflater documentation incomplete/misleading
6238bc8da2abe7a1f0cdd98c0af01e9ba1869ec3: 8333456: CompactNumberFormat integer parsing fails when string has no suffix
2a37764e7428d579a3080e62681f1c9c9f816c1e: 8333743: Change .jcheck/conf branches property to match valid branches
75dc2f8518d0adea30f7065d6732b807c0220756: 8330182: Start of release updates for JDK 24
054362abe040938b87eb1a1cab8a0a94540e0667: 8332550: [macos] Voice Over: java.awt.IllegalComponentStateException: component must be showing on the screen to determine its location
9b436d048ec92f74ec6812ae20fde21751927d4b: 8333674: Disable CollectorPolicy.young_min_ergo_vm for PPC64
487c4771818999749bfd507ab85777795bba0832: 8333647: C2 SuperWord: some additional PopulateIndex tests
... and 28 more: https://git.openjdk.org/jdk/compare/326dbb1b139dd1ec1b8605339b91697cdf49da9a...master

Your commit was automatically rebased without conflicts.

Jun 07 '24 06:06 openjdk[bot]

@reinrich Pushed as commit f7862bd6b9994814c6dfd43d471122408601f288.

:bulb: You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Jun 07 '24 06:06 openjdk[bot]

jdk jdk copied to clipboard

8331311: C2: Big Endian Port of 8318446: optimize stores into primitive arrays by combining values into larger store

Progress

Issue

Reviewing

Webrev

Webrevs

jdk
jdk copied to clipboard