jdk icon indicating copy to clipboard operation
jdk copied to clipboard

8331311: C2: Big Endian Port of 8318446: optimize stores into primitive arrays by combining values into larger store

Open reinrich opened this issue 9 months ago • 13 comments

This pr adds a few tweaks to JDK-8318446 which allows enabling it also on big endian platforms (e.g. AIX, S390). JDK-8318446 introduced a C2 optimization to replace consecutive stores to a primitive array with just one store.

By example (from TestMergeStores.java):

    static Object[] test2a(byte[] a, int offset, long v) {
        if (IS_BIG_ENDIAN) {
            a[offset + 0] = (byte)(v >> 56);
            a[offset + 1] = (byte)(v >> 48);
            a[offset + 2] = (byte)(v >> 40);
            a[offset + 3] = (byte)(v >> 32);
            a[offset + 4] = (byte)(v >> 24);
            a[offset + 5] = (byte)(v >> 16);
            a[offset + 6] = (byte)(v >> 8);
            a[offset + 7] = (byte)(v >> 0);
        } else {
            a[offset + 0] = (byte)(v >> 0);
            a[offset + 1] = (byte)(v >> 8);
            a[offset + 2] = (byte)(v >> 16);
            a[offset + 3] = (byte)(v >> 24);
            a[offset + 4] = (byte)(v >> 32);
            a[offset + 5] = (byte)(v >> 40);
            a[offset + 6] = (byte)(v >> 48);
            a[offset + 7] = (byte)(v >> 56);
        }
        return new Object[]{ a };
    }

Depending on the endianess 8 bytes are stored into an array. The order of the stores is the same as the order of an 8-byte-store therefore 8 1-byte-stores can be replaced with just one 8-byte-store (if there aren't too many range checks).

Additionally I've fixed a few comments and a test bug.

The optimization seems to be a little bit more effective on big endian platforms.

Again by example:

    static Object[] test800a(byte[] a, int offset, long v) {
        if (IS_BIG_ENDIAN) {
            a[offset + 0] = (byte)(v >> 40); // Removed from candidate list
            a[offset + 1] = (byte)(v >> 32); // Removed from candidate list
            a[offset + 2] = (byte)(v >> 24); // Merged
            a[offset + 3] = (byte)(v >> 16); // Merged
            a[offset + 4] = (byte)(v >> 8);  // Merged
            a[offset + 5] = (byte)(v >> 0);  // Merged
        } else {
            a[offset + 0] = (byte)(v >> 0);  // Removed from candidate list
            a[offset + 1] = (byte)(v >> 8);  // Removed from candidate list
            a[offset + 2] = (byte)(v >> 16); // Not merged
            a[offset + 3] = (byte)(v >> 24); // Not merged
            a[offset + 4] = (byte)(v >> 32); // Not merged
            a[offset + 5] = (byte)(v >> 40); // Not merged
        }
        return new Object[]{ a };
    }

The sequence of candidate stores begins at the lowest store (in Memory def-use order) and is trimmed to a power of 2 removing higher stores if necessary. On little endian platforms this removes the least significant bytes to be stored. Merging would require a right shift of the input value. While possible this is currently not done. With big endian order the stores of the more significant bytes are removed and the merge succeeds because no shift is needed.

I introduced new platform attributes little-endian, big-endian to the IR testing framework to be able to adapt IR matching rules to this difference.

Testing:

TestMergeStores.java on AIX and S390.

JTReg tests: tier1-4 of hotspot and jdk. All of Langtools and jaxp. JCK, SPECjvm2008, SPECjbb2015, Renaissance Suite, and SAP specific tests. Testing was done with fastdebug builds on the main platforms and also on Linux/PPC64le and AIX.


Progress

  • [ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • [x] Change must not contain extraneous whitespace
  • [x] Commit message must refer to an issue

Issue

  • JDK-8331311: C2: Big Endian Port of 8318446: optimize stores into primitive arrays by combining values into larger store (Bug - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/19218/head:pull/19218
$ git checkout pull/19218

Update a local copy of the PR:
$ git checkout pull/19218
$ git pull https://git.openjdk.org/jdk.git pull/19218/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 19218

View PR using the GUI difftool:
$ git pr show -t 19218

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/19218.diff

Webrev

Link to Webrev Comment

reinrich avatar May 13 '24 15:05 reinrich

:wave: Welcome back rrich! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

bridgekeeper[bot] avatar May 13 '24 15:05 bridgekeeper[bot]

@reinrich This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8331311: C2: Big Endian Port of 8318446: optimize stores into primitive arrays by combining values into larger store

Reviewed-by: epeter, kvn

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 33 new commits pushed to the master branch:

  • 2a37764e7428d579a3080e62681f1c9c9f816c1e: 8333743: Change .jcheck/conf branches property to match valid branches
  • 75dc2f8518d0adea30f7065d6732b807c0220756: 8330182: Start of release updates for JDK 24
  • 054362abe040938b87eb1a1cab8a0a94540e0667: 8332550: [macos] Voice Over: java.awt.IllegalComponentStateException: component must be showing on the screen to determine its location
  • 9b436d048ec92f74ec6812ae20fde21751927d4b: 8333674: Disable CollectorPolicy.young_min_ergo_vm for PPC64
  • 487c4771818999749bfd507ab85777795bba0832: 8333647: C2 SuperWord: some additional PopulateIndex tests
  • d02cb742f79e88c6438ca58a6357fe432fb286cb: 8333270: HandlersOnComplexResetUpdate and HandlersOnComplexUpdate tests fail with "Unexpected reference" if timeoutFactor is less than 1/3
  • 02f240415cbda5f67a91af50d5974fb001104170: 8333560: -Xlint:restricted does not work with --release
  • 606df441410a69034b4c113e85ce21937d1a0808: 8332670: C1 clone intrinsic needs memory barriers
  • 33fd6ae98638d2a4b33d18cc4acee4f0daaa9b35: 8333622: ubsan: relocInfo_x86.cpp:101:56: runtime error: pointer index expression with base (-1) overflowed
  • 8de5d2014a87d58d389eb8400f619d1b1fa3abe7: 8332865: ubsan: os::attempt_reserve_memory_between reports overflow
  • ... and 23 more: https://git.openjdk.org/jdk/compare/326dbb1b139dd1ec1b8605339b91697cdf49da9a...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

openjdk[bot] avatar May 13 '24 15:05 openjdk[bot]

@reinrich The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

openjdk[bot] avatar May 13 '24 15:05 openjdk[bot]

@offamitkumar you can put this through your testing if you like. It should solve the issues with test/hotspot/jtreg/compiler/c2/TestMergeStores.java also for s390.

reinrich avatar May 13 '24 15:05 reinrich

@reinrich test is passing on s390x with your change. tier1 test are in progress.

Update: tier1 test are also clean on s390x;

offamitkumar avatar May 13 '24 16:05 offamitkumar

@reinrich thanks for taking this up! Just did a quick scan of the tests. I think it could be good to have both big/small endian tests run on both big/small endian machines, but only expect IR rules to pass if the test and platform are expected to optimize. This just makes sure that the logic is correct, and does not optimize the wrong cases, producing wrong results.

eme64 avatar May 14 '24 16:05 eme64

It's not obvious to me why something like

            a[offset + 2] = (byte)(v >> 16); // Not merged
            a[offset + 3] = (byte)(v >> 24); // Not merged
            a[offset + 4] = (byte)(v >> 32); // Not merged
            a[offset + 5] = (byte)(v >> 40); // Not merged

can't be merged. Is it because you only use vas a possible 32-bit value? Why not use something like the following pseudo-code?

int bytes2word(byte b1, byte b2, byte b3, byte b4) {
  return (b1 & 0xff) << 24 | (b2 & 0xff) << 16 | (b3 & 0xff) << 8 | (b4 & 0xff);
}
// Substituting in the values from the example:
int big_endian = bytes2word((byte)(v >> 40), (byte)(v >> 32), (byte)(v >> 24), (byte)(v >> 16));
int little_endian = bytes2word((byte)(v >> 16), (byte)(v >> 24), (byte)(v >> 32), (byte)(v >> 40));

dean-long avatar May 14 '24 21:05 dean-long

In other words, it seems like it could work for arbitrary byte values if the merged value was computed from those individual values. They wouldn't need to be shifted values.

            a[offset + 0] = (byte)0x1;
            a[offset + 1] = (byte)(0x2;
            a[offset + 2] = (byte)0x3;
            a[offset + 3] = (byte)(0x4;

The example above would either write 0x01020304 or 0x04030201 depending on the endianness.

dean-long avatar May 14 '24 21:05 dean-long

It's not obvious to me why something like

            a[offset + 2] = (byte)(v >> 16); // Not merged
            a[offset + 3] = (byte)(v >> 24); // Not merged
            a[offset + 4] = (byte)(v >> 32); // Not merged
            a[offset + 5] = (byte)(v >> 40); // Not merged

can't be merged.

The stores could be merged to the following pseudo code:

  *(int*)&a[offset + 2] = (int)(v >> 16); // Merged

The current logic doesn't accept the right shift here. I think at that location we can always accept merged_input_value asserting that it is a right shift of base_last since the is_adjacent_input_pair checks succeeded before. I haven't tried it though.

I'll clarify the synopsis of this pr and the comment in TestMergeStores.java.

reinrich avatar May 15 '24 07:05 reinrich

@dean-long @reinrich Yes, I guess that is a generalization that could be made. It would require a lot more tests to make sure all combinations are checked. I would suggest doing that in a separate RFE to keep things simple and reviewable.

eme64 avatar May 15 '24 07:05 eme64

Thanks for looking at the pr.

Just did a quick scan of the tests. I think it could be good to have both big/small endian tests run on both big/small endian machines, but only expect IR rules to pass if the test and platform are expected to optimize. This just makes sure that the logic is correct, and does not optimize the wrong cases, producing wrong results.

I've done that for test2 and introduced test2BE. Is that want you mean?

reinrich avatar May 15 '24 14:05 reinrich

Test error is unrelated to the changes. Upload of test results failed: Error: Failed to CreateArtifact: Failed to make request after 5 attempts: Request timeout: /twirp/github.actions.results.api.v1.ArtifactService/CreateArtifact

reinrich avatar May 16 '24 05:05 reinrich

@reinrich please ping me again to ask if testing is ok before you integrate ;)

eme64 avatar May 24 '24 08:05 eme64

@reinrich please ping me again to ask if testing is ok before you integrate ;)

Thanks for picking this up again. I quickly wanted to let you know that I'm out of office. I will be back in a week.

reinrich avatar May 24 '24 15:05 reinrich

@reinrich please still wait until the JDK24 fork on Thrusday to integrate, so that we do not have to backport possible regression fixes - I had 3 or 4 with my original patch ;)

eme64 avatar Jun 04 '24 16:06 eme64

I'm running testing again, but the code looks good now!

I just had another idea: Could we use some sort of "byte reverse / shuffle" operation to do these use cases for both big/little-endian?

        storeBytes(bytes, offset, (byte)(value >> 8),
                                  (byte)(value >> 0));

        storeBytes(bytes, offset, (byte)(value >> 0),
                                  (byte)(value >> 8));

Not sure if that would be profitable or even available on all platforms. Could be a future RFE someone can work on after this. What do you think? It might make performance more predictable across platforms.

You mean to combine the stores even if the explicit ordering does not match the ordering of the store instruction, adding a ReverseBytes[SIL]Node iff supported in that case, right? I've been thinking about this, too. In my opinion it would be worthwhile.

reinrich avatar Jun 05 '24 07:06 reinrich

@reinrich please still wait until the JDK24 fork on Thrusday to integrate, so that we do not have to backport possible regression fixes - I had 3 or 4 with my original patch ;)

Thanks for the reviews @eme64 and @vnkozlov! I'll integrate after the code split if more local testing is successful.

reinrich avatar Jun 05 '24 07:06 reinrich

I did another round of testing on s390x. looks good.

offamitkumar avatar Jun 06 '24 07:06 offamitkumar

I did another round of testing on s390x. looks good.

Thanks Amit.

reinrich avatar Jun 06 '24 08:06 reinrich

/integrate

reinrich avatar Jun 07 '24 06:06 reinrich

Going to push as commit f7862bd6b9994814c6dfd43d471122408601f288. Since your change was applied there have been 38 commits pushed to the master branch:

  • b4beda21b487886b022e04766e140e6d1df1038a: 8332537: C2: High memory usage reported for compiler/loopopts/superword/TestAlignVectorFuzzer.java
  • e5383d710c0727181a2f0b569a881de2492e3683: 8333713: C2 SuperWord: cleanup in vectornode.cpp/hpp
  • 944aeb81b16e3e7a3019cafdefe67b797fa6be96: 8325155: C2 SuperWord: remove alignment boundaries
  • d8af58941b5dedb9774c0971895c4924e57ac28b: 8026127: Deflater/Inflater documentation incomplete/misleading
  • 6238bc8da2abe7a1f0cdd98c0af01e9ba1869ec3: 8333456: CompactNumberFormat integer parsing fails when string has no suffix
  • 2a37764e7428d579a3080e62681f1c9c9f816c1e: 8333743: Change .jcheck/conf branches property to match valid branches
  • 75dc2f8518d0adea30f7065d6732b807c0220756: 8330182: Start of release updates for JDK 24
  • 054362abe040938b87eb1a1cab8a0a94540e0667: 8332550: [macos] Voice Over: java.awt.IllegalComponentStateException: component must be showing on the screen to determine its location
  • 9b436d048ec92f74ec6812ae20fde21751927d4b: 8333674: Disable CollectorPolicy.young_min_ergo_vm for PPC64
  • 487c4771818999749bfd507ab85777795bba0832: 8333647: C2 SuperWord: some additional PopulateIndex tests
  • ... and 28 more: https://git.openjdk.org/jdk/compare/326dbb1b139dd1ec1b8605339b91697cdf49da9a...master

Your commit was automatically rebased without conflicts.

openjdk[bot] avatar Jun 07 '24 06:06 openjdk[bot]

@reinrich Pushed as commit f7862bd6b9994814c6dfd43d471122408601f288.

:bulb: You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

openjdk[bot] avatar Jun 07 '24 06:06 openjdk[bot]