jdk icon indicating copy to clipboard operation
jdk copied to clipboard

8325520: Vector loads with offsets incorrectly compiled

Open dafedafe opened this issue 11 months ago • 22 comments

Issue

When loading multiple vectors using offsets or masks (e.g. LongVector::fromArray(LongVector.SPECIES_256, storage, 0, offsets, 0 or LongVector::fromArray(LongVector.SPECIES_256, storage, 0, longMask)) there is an error in the C2 compiled code that makes different vectors be treated as equal even though they are not.

Causes

On vector-capable platforms, vector loads with masks and offsets (for Long, Integer, Float and Double) create specific nodes in the ideal graph (i.e. LoadVectorGather, LoadVectorMasked, LoadVectorGatherMasked). Vector loads without mask or offsets are mapped as LoadVector nodes instead. The same is true for StoreVectors. When running GVN loops we can get to the situation where we check if a Load node is preceded by a Store of the same address to be able to replace the Load with the input of the Store (in LoadNode::Identity). Here we call

https://github.com/openjdk/jdk/blob/87e864bf21d71daae4e001ec4edbb4ef1f60c36d/src/hotspot/share/opto/memnode.cpp#L1258

where we do an extra check for types if we deal with vectors but we don’t make sure that neither masks nor offsets interfere. Similarly, in StoreNode::Identity we first check if there is a Load and then a Store:

https://github.com/openjdk/jdk/blob/87e864bf21d71daae4e001ec4edbb4ef1f60c36d/src/hotspot/share/opto/memnode.cpp#L3509-L3515

but we don’t make sure that there are no masks or offsets. A few lines below, we check if there are 2 stores for the same value in a row. We need to check for masks and offsets here too but in this case we can include these cases if the masks and offsets of the vector stores are equivalent.

Solution

To avoid folding Load- and StoreVectors with masks and offsets we add a specific store_Opcode method to LoadVectorGatherNode, LoadVectorMaskedNode and LoadVectorGatherMaskedNode that doesn’t return a store opcode but instead returns its own (to avoid ever being the same as a store node). In this way, the checks in MemNode::can_see_stored_value

https://github.com/openjdk/jdk/blob/87e864bf21d71daae4e001ec4edbb4ef1f60c36d/src/hotspot/share/opto/memnode.cpp#L1164-L1166

and StoreNode::Identity

https://github.com/openjdk/jdk/blob/87e864bf21d71daae4e001ec4edbb4ef1f60c36d/src/hotspot/share/opto/memnode.cpp#L3509-L3515

will fail if masks or offsets are used. For 2 stores of the same value we instead check for mask and offset equality.

Regression tests for all versions of Load/StoreVectorGather/Masked have been added too.


Progress

  • [ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • [x] Change must not contain extraneous whitespace
  • [x] Commit message must refer to an issue

Issue

  • JDK-8325520: Vector loads with offsets incorrectly compiled (Bug - P3)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/18347/head:pull/18347
$ git checkout pull/18347

Update a local copy of the PR:
$ git checkout pull/18347
$ git pull https://git.openjdk.org/jdk.git pull/18347/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 18347

View PR using the GUI difftool:
$ git pr show -t 18347

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/18347.diff

Webrev

Link to Webrev Comment

dafedafe avatar Mar 18 '24 12:03 dafedafe

:wave: Welcome back dfenacci! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

bridgekeeper[bot] avatar Mar 18 '24 12:03 bridgekeeper[bot]

@dafedafe This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8325520: Vector loads and stores with indices and masks incorrectly compiled

Reviewed-by: epeter, thartmann

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 359 new commits pushed to the master branch:

  • 612ae9289a130b8701f74253fe5499358a2e2b5b: 8332735: [JVMCI] Add extra JVMCI events for exception translation
  • 1ea76d338b99900089277b7a2da82c24382a6ce1: 8332675: test/hotspot/jtreg/gc/testlibrary/Helpers.java compileClass javadoc does not match after 8321812
  • 94af3c23ea09ef2869cdc666d8170a655a0b3602: 8329203: Parallel: Investigate Mark-Compact for Full GC to decrease memory usage
  • 1e5a2780d9cc8e73ce65bdccb98c1808aadd0784: 8332676: Remove unused BarrierSetAssembler::incr_allocated_bytes
  • c2180d141ccca0e396ee9a0cd3044c4428b963d5: 8315767: InetAddress: constructing objects from BSD literal addresses
  • 2a11e0da026066191e4d4f30b9daca986c484630: 8332743: Update comment related to JDK-8320522
  • 6829d9ac67fb131462d3ef1c4bdfaa07df5d6be6: 8332122: [nmt] Totals for malloc should show total peak
  • 9d332e6591334a71335da65a4dd7b2ed0482b6cb: 8307193: Several Swing jtreg tests use class.forName on L&F classes
  • 98f6a80852383dcbdad7292b7d269a8547d54d45: 8332490: JMH org.openjdk.bench.java.util.zip.InflaterInputStreams.inflaterInputStreamRead OOM
  • 3d4185a9ce482cc655a4c67f39cb2682b02ae4fe: 8332739: Problemlist compiler/codecache/CheckLargePages until JDK-8332654 is fixed
  • ... and 349 more: https://git.openjdk.org/jdk/compare/174d62652c69e811cf44ab64db575b13a848a728...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

openjdk[bot] avatar Mar 18 '24 12:03 openjdk[bot]

@dafedafe The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

openjdk[bot] avatar Mar 18 '24 12:03 openjdk[bot]

I see that LoadVectorNode has no additional slots, but then LoadVectorMaskedNode and LoadVectorGatherMaskedNode simply do add_req. Why not just compare these extra slots (as many as there are) in LoadVectorNode::Identity, as well as checking that their Opcode is the same? Would that not be simpler and accomplish the same?

eme64 avatar Mar 21 '24 08:03 eme64

Can you please fix the whitespace issues, it is a bit difficult to read now on GitHub ;)

Sorry, I let them slip in. Fixed.

dafedafe avatar Mar 21 '24 09:03 dafedafe

It seems that your code would now disallow such a case, because you always check that the Ideal node you get back is of the same type as the this node. Am I right about that? Is that intended?

If I am right that you would have made such a optimization impossible, that probably means that our tests don't have an IR test that cover this case. You would definitely need to add such IR tests, otherwise we don't know if we are getting regressions. You will probably also have to run this patch through performance testing eventually.

I've tried to use the IR framework to check for folded/non folded nodes but couldn't make it reliable enough (in the end it didn't test more than the current test). So I decided to go back to the actual "regression" test which reproduces the original issue.

Do you know what is the issue with reliability for the IR rules? Why did it not always work?

eme64 avatar Mar 27 '24 09:03 eme64

I think you checks should also not be done after we already create a new node, but ideally before we create the new node. That is the normal pattern I see everywhere. So then you would probably have to dig into MemNode::can_see_stored_value and other places, to see where you would need to do your additional checks. There are already some checks for vector-type there, so that would be one intuitive starting-point.

I've removed the Identity overrides and changed the checks in MemNode::can_see_stored_value as you suggested (Identity methods were also a bit too restrictive). I also had to add checks to StoreNode::Identity here

https://github.com/openjdk/jdk/blob/7eb78e332080df3890b66ca04338a4ba69af45eb/src/hotspot/share/opto/memnode.cpp#L2799-L2814

dafedafe avatar Apr 09 '24 07:04 dafedafe

If I am right that you would have made such a optimization impossible, that probably means that our tests don't have an IR test that cover this case. You would definitely need to add such IR tests, otherwise we don't know if we are getting regressions. You will probably also have to run this patch through performance testing eventually.

I've transformed the tests to add IR tests as well. The issue with them seems to be related with JDK-8302459 (basically there are some missing cleanups when performing late inlining). So, for now the tests force a cleanup at every step (-XX:+IncrementalInlineForceCleanup).

dafedafe avatar Apr 09 '24 07:04 dafedafe

@dafedafe Nice, I think this already looks much better. Let me know if/when you want me to look at it again ;)

eme64 avatar Apr 15 '24 06:04 eme64

@dafedafe Nice, I think this already looks much better. Let me know if/when you want me to look at it again ;)

Thanks a lot @eme64!

dafedafe avatar Apr 15 '24 07:04 dafedafe

I've randomly thought about this change again (while trying to sleep yesterday...). And started worrying about this kind of case:

import jdk.incubator.vector.*;

public class Test {
    private static final VectorSpecies<Integer> I_SPECIES = IntVector.SPECIES_MAX;

    public static void main(String[] args) {
        // create mask
        boolean[] intMask = new boolean[I_SPECIES.length()];
        for (int i = 0; i < intMask.length; i++) {
            intMask[i] = (i % 2 == 0);
        }
        int[] intArray1 = new int[I_SPECIES.length()];
        int[] intArray2 = new int[I_SPECIES.length()];
        int[] intArray3 = new int[I_SPECIES.length()];
        for (int i = 0; i < intArray1.length; i++) {
            intArray1[i] = i;
            intArray2[i] = -2 * i;
            intArray3[i] = 0;
        }

        for (int i = 0; i < 10_000; i++) {
            test1(intMask, intArray1, intArray2, intArray3);
        }

        for (int i = 0; i < intArray1.length; i++) {
            System.out.println("i: " + i + " " + intArray1[i] + " " +intArray2[i] + " " + intArray3[i]);
        }
    }

    static void test1(boolean[] intMask, int[] intArray1, int[] intArray2, int[] intArray3) {
        VectorMask<Integer> intVectorMask = VectorMask.fromArray(I_SPECIES, intMask, 0);

        // Load values:    0   1   2   3   4   5 ...
        IntVector a = IntVector.fromArray(I_SPECIES, intArray1, 0);

        // Store, but only every second value:
	// Store:          0   x   2   x   4   x ...
	// Already there:  0  -2  -4  -6  -8 -10 ...
	// Result:         0  -2   2  -6   4 -10 ...
        a.intoArray(intArray2, 0, intVectorMask);

        // Load, but we cannot just take the value from vector a, since we mask it, and where the mask is
	// off, we must have zero.
	// Load:           0   0   2   0   4   0 ...
        IntVector b = IntVector.fromArray(I_SPECIES, intArray2, 0, intVectorMask);
        b.intoArray(intArray3, 0);
    }
}

This is not a bug yet, I think, but it could be if you do the change I suggested with store_Opcode(). Currently, if we have:

 2035  StoreVectorMasked  === 934 7 2031 1979 1956  |893  [[ 2034 2041 2066 ]]  @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact[0] *, idx=7; mismatched  Memory: @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact[0] *, idx=7; !jvms: IntVector::intoArray0Template @ bci:50 (line 3556) IntMaxVector::intoArray0 @ bci:9 (line 911) IntVector::intoArray @ bci:53 (line 3269) Test::test1 @ bci:26 (line 34)
 2041  LoadVectorMasked  === 934 2035 2031 1956  |893  [[ 1767 2066 1750 ]]  @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact[0] *, idx=7; mismatched #vectorz[16]:{int} !jvms: IntVector::fromArray0Template @ bci:53 (line 3455) IntMaxVector::fromArray0 @ bci:11 (line 874) IntVector::fromArray @ bci:32 (line 2986) Test::test1 @ bci:36 (line 35)

The load does not succeed in MemNode::can_see_stored_value, because the load's store_Opcode() == Op_StoreVector, and not Op_StoreVectorMasked. But if you were to implement LoadVectorMasked::store_Opcode() const { return Op_StoreVectorMasked; }, then you have to be careful: A masked load does not necessarily return the same as the masked store's input value. That input value is not yet masked, but the loaded value needs to be masked.

But it seems to me that you can actually never have a successful MemNode::can_see_stored_value case for masked operations, with your current code. It would always fail the store_Opcode() == st->Opcode() check. And for that gives the correct result, but it is still a bit strange that we don't override the store_Opcode for the masked/offset vector stores.

I don't know which way you want to go now. I these options:

  • Keep disallowing masked load/store "look-throughs".
    • Do that by having the "incorrect" store_Opcode as now. The downside is that the "offset only" case does not manage to do the look-through, even though that would be correct.
    • OR: have the correct store_Opcode, which allows the look-through for the "offset only" case. But then explicitly check for the masked cases, and disallow those.
  • Implement a special look-through, where you apply the mask with some blend/select/masked operation on the store input value, which simulates the masked load (i.e. you need to put zeros where the mask is off).

Not sure if this is all very clear, feel free to ask.

eme64 avatar Apr 23 '24 10:04 eme64

@dafedafe this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout JDK-8325520
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

openjdk[bot] avatar Apr 24 '24 09:04 openjdk[bot]

⚠️ @dafedafe This pull request contains merges that bring in commits not present in the target repository. Since this is not a "merge style" pull request, these changes will be squashed when this pull request in integrated. If this is your intention, then please ignore this message. If you want to preserve the commit structure, you must change the title of this pull request to Merge <project>:<branch> where <project> is the name of another project in the OpenJDK organization (for example Merge jdk:master).

openjdk[bot] avatar Apr 25 '24 08:04 openjdk[bot]

The load does not succeed in MemNode::can_see_stored_value, because the load's store_Opcode() == Op_StoreVector, and not Op_StoreVectorMasked. But if you were to implement LoadVectorMasked::store_Opcode() const { return Op_StoreVectorMasked; }, then you have to be careful: A masked load does not necessarily return the same as the masked store's input value. That input value is not yet masked, but the loaded value needs to be masked.

But it seems to me that you can actually never have a successful MemNode::can_see_stored_value case for masked operations, with your current code. It would always fail the store_Opcode() == st->Opcode() check. And for that gives the correct result, but it is still a bit strange that we don't override the store_Opcode for the masked/offset vector stores.

I don't know which way you want to go now. I these options:

  • Keep disallowing masked load/store "look-throughs".

    • Do that by having the "incorrect" store_Opcode as now. The downside is that the "offset only" case does not manage to do the look-through, even though that would be correct.
    • OR: have the correct store_Opcode, which allows the look-through for the "offset only" case. But then explicitly check for the masked cases, and disallow those.
  • Implement a special look-through, where you apply the mask with some blend/select/masked operation on the store input value, which simulates the masked load (i.e. you need to put zeros where the mask is off).

Not sure if this is all very clear, feel free to ask.

Yep it is actually very clear, thanks a lot @eme64! Actually your example is with masks, but there is a similar problem with offsets as well. Think of what happens if the offset array is shorter than the length of the array or if there are duplicate indices in the offset array: the result of a load-store or a store-load sequence doesn’t produce the same original array or vector. So actually to avoid folding in those 2 cases (load->store and store->load) for now I made specific store_Opcode implementations that are never the same as a Store opcode, so that the store_Opcode() == st->Opcode() and val->as_Load()->store_Opcode() == Opcode() is always false (I’ve used the opcode of the LoadNode itself but possibly there is a better solution). As you hint one could also think of a solution involving substituting load/store masked vector nodes with “simple” vector mask nodes and load/store gather/scatter vector nodes with “simple” vector shuffle nodes (if these exist) but I’m not sure how much this would bring and anyway for this issue I would keep the fix simple. What do you think?

dafedafe avatar Apr 25 '24 08:04 dafedafe

Nice, ah you are right, there can be issues with mask-only cases as well!

It would be great if you had tests that exactly exercise these "bad" examples, where it looks like we might optimize, but it would be wrong.

I'll look at your store_Opcode changes now...

eme64 avatar Apr 25 '24 12:04 eme64

About storing a IntVector to memory, and then loading as FloatVector: You can use a MemorySegment:

// Wrap an array into a MemorySegment
MemorySegment ms = MemorySegment.fromArray(new byte[10_000]);

// Create your favourite int vector
IntVector intVector = ...;

// Store that int vector to the memory segment (internally, it does checkIndex and unsafe store to the byte array)
intVector.intoMemorySegment(ms, offset, ByteOrder.nativeOrder());

// Load a float vector from the memory segment (internally, it does checkIndex and unsafe load from the byte array)
FloatVector floatVector = FloatVector.fromMemorySegment(ms, offset, ByteOrder.nativeOrder())

I did not test this, but I think something like this should work.

eme64 avatar Apr 25 '24 12:04 eme64

// Load a float vector from the memory segment (internally, it does checkIndex and unsafe load from the byte array) FloatVector floatVector = FloatVector.fromMemorySegment(ms, offset, ByteOrder.nativeOrder())


I did not test this, but I think something like this should work.

Right! I totally missed that from- intoMmemorySegment methods! I guess that’s probably one reason why vector nodes are not typed. I’ve added checks to StoreNode::Identity as well.

dafedafe avatar May 06 '24 08:05 dafedafe

It would be great if you had tests that exactly exercise these "bad" examples, where it looks like we might optimize, but it would be wrong.

Yep, good idea. I've added a few tests to check for those cases (load-store with duplicate offsets and store-load with masks). Thanks @eme64!

dafedafe avatar May 06 '24 15:05 dafedafe

@dafedafe I also scanned quickly over the regression tests. I see at least two aspects missing:

  • No mixed type test for load-store: Use MemorySegment from/intoMmemorySegment. Try something like store a int-vector, and load a float-vector.
  • Mismatched vector length: store a vector of length 4, and load one of length 8.

I think all of these are currently correctly handled by your vect_type checks in the VM code, but it would be good to see that they are covered by regression tests, in case someone messes this up in the future.

eme64 avatar May 07 '24 12:05 eme64

Ah, some more missing cases:

  • Do some store-store and store-load cases where you the first and second are different loads/stores, i.e. one with and one without mask/offsets. E.g. StoreVectorMasked and StoreVectorScatter in a store-store test. Doing the total cross-product is probably too much, but a few examples would be a good start.

eme64 avatar May 07 '24 13:05 eme64

  • No mixed type test for load-store: Use MemorySegment from/intoMmemorySegment. Try something like store a int-vector, and load a float-vector.

It looks as if load/stores that use from/intoMemorySegment with different types apparently don’t create LoadVector nodes. It seems that fromMemorySegment tries to inline the VectorSupport::load intrinsic, but fails as the type of the vector and the inferred type of the underlying memory segment differ: https://github.com/openjdk/jdk/blob/9b742109b196d79cbf712ffd3f64edd1d6497114/src/hotspot/share/opto/vectorIntrinsics.cpp#L1055-L1064

🤔 I'm just wondering if we could get rid of the type check then...

  • Mismatched vector length: store a vector of length 4, and load one of length 8.

I've added tests tests that store and load with different species (SPECIES_64).

  • Do some store-store and store-load cases where you the first and second are different loads/stores, i.e. one with and one without mask/offsets. E.g. StoreVectorMasked and StoreVectorScatter in a store-store test. Doing the total cross-product is probably too much, but a few examples would be a good start.

You're right, there were just very few of them. Added many more.

dafedafe avatar May 13 '24 21:05 dafedafe

It looks as if load/stores that use from/intoMemorySegment with different types apparently don’t create LoadVector nodes. It seems that fromMemorySegment tries to inline the VectorSupport::load intrinsic, but fails as the type of the vector and the inferred type of the underlying memory segment differ:

Ha, that seems to be a bit of an arbitrary (maybe just conservative) restriction. I hope we can lift that in the future. We do not have such restrictions for scalar array load/store. And we can also Auto-Vectorize those. Would be interesting to ask some VectorAPI folks about that.

Because we could allow those to use vectors in the future: I would leave the type checks in for now.

Looking at the tests now!

eme64 avatar May 21 '24 06:05 eme64

Nice, the tests look good, thanks for adding them!

eme64 avatar May 21 '24 06:05 eme64

Because we could allow those to use vectors in the future: I would leave the type checks in for now.

👍 better safe than sorry

dafedafe avatar May 21 '24 13:05 dafedafe

This part in the PR description could be updated: now we return -1 for those that we think are not "comparable".

You're right. Fixed. Thanks a lot for the review @eme64!!

dafedafe avatar May 21 '24 14:05 dafedafe

@dafedafe I recommend that you update the Bug title. It names "offsets", but really you probably want to name "indices and mask", or just "masked and gather/scatter vector operations". And it also is not just about "loads", but also "stores", correct?

eme64 avatar May 22 '24 14:05 eme64

@dafedafe I recommend that you update the Bug title. It names "offsets", but really you probably want to name "indices and mask", or just "masked and gather/scatter vector operations". And it also is not just about "loads", but also "stores", correct?

Yep. I changed the title... and adapted the description as well (offsets -> indices). Thanks @eme64

dafedafe avatar May 22 '24 14:05 dafedafe

/integrate

dafedafe avatar May 24 '24 13:05 dafedafe

Going to push as commit 0c934ff4e2fb53a72ad25a080d956745a5649f9b. Since your change was applied there have been 382 commits pushed to the master branch:

  • c099f14f07260713229cffbe7d23aa8305415a67: 8305457: Implement java.io.IO
  • 6a35311468222f9335b43d548df2ecb80746b389: 8241550: [macOS] SSLSocketImpl/ReuseAddr.java failed due to "BindException: Address already in use"
  • f16265d69b09640b972b7494ad57158dbdc426bb: 8332226: "Invalid package name:" from source launcher
  • 5a2ba952b120394d7cc0d0890619780c1c27a078: 8325841: Remove unused references to vmSymbols.hpp
  • 239c1b33b47de43369673f33d9449e1904477ce0: 8332807: Parallel: Make some APIs in ParMarkBitMap private
  • 9b61a7608efff13fc3685488f3f54a810ec0ac22: 8332615: RISC-V: Support vector unsigned comparison instructions for machines with RVV
  • a71b40478510db3c69696df608fd1b32f41c57f3: 8331398: G1: G1HeapRegionPrinter reclamation events should print the original region type
  • af056c1676dab3b0b35666a8259db60f9bbf824e: 8332106: VerifyError when using switch pattern in this(...) or super(...)
  • da3001daf79bf943d6194d9fd60250d519b9680d: 8331975: Enable case-insensitive check in ccache and keytab entry lookup
  • 424eb60dedb332237b8ec97e9da6bd95442c0083: 8331683: Clean up GetCarrierThread
  • ... and 372 more: https://git.openjdk.org/jdk/compare/174d62652c69e811cf44ab64db575b13a848a728...master

Your commit was automatically rebased without conflicts.

openjdk[bot] avatar May 24 '24 13:05 openjdk[bot]