jdk
jdk copied to clipboard
8325520: Vector loads with offsets incorrectly compiled
Issue
When loading multiple vectors using offsets or masks (e.g. LongVector::fromArray(LongVector.SPECIES_256, storage, 0, offsets, 0
or LongVector::fromArray(LongVector.SPECIES_256, storage, 0, longMask)
) there is an error in the C2 compiled code that makes different vectors be treated as equal even though they are not.
Causes
On vector-capable platforms, vector loads with masks and offsets (for Long, Integer, Float and Double) create specific nodes in the ideal graph (i.e. LoadVectorGather
, LoadVectorMasked
, LoadVectorGatherMasked
). Vector loads without mask or offsets are mapped as LoadVector
nodes instead.
The same is true for StoreVector
s.
When running GVN loops we can get to the situation where we check if a Load node is preceded by a Store of the same address to be able to replace the Load with the input of the Store (in LoadNode::Identity
). Here we call
https://github.com/openjdk/jdk/blob/87e864bf21d71daae4e001ec4edbb4ef1f60c36d/src/hotspot/share/opto/memnode.cpp#L1258
where we do an extra check for types if we deal with vectors but we don’t make sure that neither masks nor offsets interfere.
Similarly, in StoreNode::Identity
we first check if there is a Load and then a Store:
https://github.com/openjdk/jdk/blob/87e864bf21d71daae4e001ec4edbb4ef1f60c36d/src/hotspot/share/opto/memnode.cpp#L3509-L3515
but we don’t make sure that there are no masks or offsets. A few lines below, we check if there are 2 stores for the same value in a row. We need to check for masks and offsets here too but in this case we can include these cases if the masks and offsets of the vector stores are equivalent.
Solution
To avoid folding Load
- and StoreVector
s with masks and offsets we add a specific store_Opcode
method to LoadVectorGatherNode
, LoadVectorMaskedNode
and LoadVectorGatherMaskedNode
that doesn’t return a store opcode but instead returns its own (to avoid ever being the same as a store node). In this way, the checks in MemNode::can_see_stored_value
https://github.com/openjdk/jdk/blob/87e864bf21d71daae4e001ec4edbb4ef1f60c36d/src/hotspot/share/opto/memnode.cpp#L1164-L1166
and StoreNode::Identity
https://github.com/openjdk/jdk/blob/87e864bf21d71daae4e001ec4edbb4ef1f60c36d/src/hotspot/share/opto/memnode.cpp#L3509-L3515
will fail if masks or offsets are used. For 2 stores of the same value we instead check for mask and offset equality.
Regression tests for all versions of Load/StoreVectorGather/Masked
have been added too.
Progress
- [ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
- [x] Change must not contain extraneous whitespace
- [x] Commit message must refer to an issue
Issue
- JDK-8325520: Vector loads with offsets incorrectly compiled (Bug - P3)
Reviewing
Using git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/18347/head:pull/18347
$ git checkout pull/18347
Update a local copy of the PR:
$ git checkout pull/18347
$ git pull https://git.openjdk.org/jdk.git pull/18347/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 18347
View PR using the GUI difftool:
$ git pr show -t 18347
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/18347.diff
Webrev
:wave: Welcome back dfenacci! A progress list of the required criteria for merging this PR into master
will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.
@dafedafe This change now passes all automated pre-integration checks.
ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.
After integration, the commit message for the final commit will be:
8325520: Vector loads and stores with indices and masks incorrectly compiled
Reviewed-by: epeter, thartmann
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.
At the time when this comment was updated there had been 359 new commits pushed to the master
branch:
- 612ae9289a130b8701f74253fe5499358a2e2b5b: 8332735: [JVMCI] Add extra JVMCI events for exception translation
- 1ea76d338b99900089277b7a2da82c24382a6ce1: 8332675: test/hotspot/jtreg/gc/testlibrary/Helpers.java compileClass javadoc does not match after 8321812
- 94af3c23ea09ef2869cdc666d8170a655a0b3602: 8329203: Parallel: Investigate Mark-Compact for Full GC to decrease memory usage
- 1e5a2780d9cc8e73ce65bdccb98c1808aadd0784: 8332676: Remove unused BarrierSetAssembler::incr_allocated_bytes
- c2180d141ccca0e396ee9a0cd3044c4428b963d5: 8315767: InetAddress: constructing objects from BSD literal addresses
- 2a11e0da026066191e4d4f30b9daca986c484630: 8332743: Update comment related to JDK-8320522
- 6829d9ac67fb131462d3ef1c4bdfaa07df5d6be6: 8332122: [nmt] Totals for malloc should show total peak
- 9d332e6591334a71335da65a4dd7b2ed0482b6cb: 8307193: Several Swing jtreg tests use class.forName on L&F classes
- 98f6a80852383dcbdad7292b7d269a8547d54d45: 8332490: JMH org.openjdk.bench.java.util.zip.InflaterInputStreams.inflaterInputStreamRead OOM
- 3d4185a9ce482cc655a4c67f39cb2682b02ae4fe: 8332739: Problemlist compiler/codecache/CheckLargePages until JDK-8332654 is fixed
- ... and 349 more: https://git.openjdk.org/jdk/compare/174d62652c69e811cf44ab64db575b13a848a728...master
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.
➡️ To integrate this PR with the above commit message to the master
branch, type /integrate in a new comment.
@dafedafe The following label will be automatically applied to this pull request:
-
hotspot-compiler
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.
I see that LoadVectorNode
has no additional slots, but then LoadVectorMaskedNode
and LoadVectorGatherMaskedNode
simply do add_req
. Why not just compare these extra slots (as many as there are) in LoadVectorNode::Identity
, as well as checking that their Opcode
is the same? Would that not be simpler and accomplish the same?
Can you please fix the whitespace issues, it is a bit difficult to read now on GitHub ;)
Sorry, I let them slip in. Fixed.
It seems that your code would now disallow such a case, because you always check that the Ideal node you get back is of the same type as the this node. Am I right about that? Is that intended?
If I am right that you would have made such a optimization impossible, that probably means that our tests don't have an IR test that cover this case. You would definitely need to add such IR tests, otherwise we don't know if we are getting regressions. You will probably also have to run this patch through performance testing eventually.
I've tried to use the IR framework to check for folded/non folded nodes but couldn't make it reliable enough (in the end it didn't test more than the current test). So I decided to go back to the actual "regression" test which reproduces the original issue.
Do you know what is the issue with reliability for the IR rules? Why did it not always work?
I think you checks should also not be done after we already create a new node, but ideally before we create the new node. That is the normal pattern I see everywhere. So then you would probably have to dig into
MemNode::can_see_stored_value
and other places, to see where you would need to do your additional checks. There are already some checks for vector-type there, so that would be one intuitive starting-point.
I've removed the Identity
overrides and changed the checks in MemNode::can_see_stored_value
as you suggested (Identity
methods were also a bit too restrictive). I also had to add checks to StoreNode::Identity
here
https://github.com/openjdk/jdk/blob/7eb78e332080df3890b66ca04338a4ba69af45eb/src/hotspot/share/opto/memnode.cpp#L2799-L2814
If I am right that you would have made such a optimization impossible, that probably means that our tests don't have an IR test that cover this case. You would definitely need to add such IR tests, otherwise we don't know if we are getting regressions. You will probably also have to run this patch through performance testing eventually.
I've transformed the tests to add IR tests as well. The issue with them seems to be related with JDK-8302459 (basically there are some missing cleanups when performing late inlining). So, for now the tests force a cleanup at every step (-XX:+IncrementalInlineForceCleanup
).
@dafedafe Nice, I think this already looks much better. Let me know if/when you want me to look at it again ;)
@dafedafe Nice, I think this already looks much better. Let me know if/when you want me to look at it again ;)
Thanks a lot @eme64!
Webrevs
- 14: Full - Incremental (0c015fa1)
- 13: Full - Incremental (50a75ffe)
- 12: Full - Incremental (8f341cd6)
- 11: Full - Incremental (df3a49ae)
- 10: Full - Incremental (e676bcb1)
- 09: Full - Incremental (777bf562)
- 08: Full - Incremental (9b742109)
- 07: Full - Incremental (a2cb6a58)
- 06: Full - Incremental (72bf6ca3)
- 05: Full - Incremental (85bb4bef)
- 04: Full - Incremental (524ff888)
- 03: Full - Incremental (d25bcacf)
- 02: Full (b7e5fe02)
- 01: Full - Incremental (2c83876d)
- 00: Full (e331b5c5)
I've randomly thought about this change again (while trying to sleep yesterday...). And started worrying about this kind of case:
import jdk.incubator.vector.*;
public class Test {
private static final VectorSpecies<Integer> I_SPECIES = IntVector.SPECIES_MAX;
public static void main(String[] args) {
// create mask
boolean[] intMask = new boolean[I_SPECIES.length()];
for (int i = 0; i < intMask.length; i++) {
intMask[i] = (i % 2 == 0);
}
int[] intArray1 = new int[I_SPECIES.length()];
int[] intArray2 = new int[I_SPECIES.length()];
int[] intArray3 = new int[I_SPECIES.length()];
for (int i = 0; i < intArray1.length; i++) {
intArray1[i] = i;
intArray2[i] = -2 * i;
intArray3[i] = 0;
}
for (int i = 0; i < 10_000; i++) {
test1(intMask, intArray1, intArray2, intArray3);
}
for (int i = 0; i < intArray1.length; i++) {
System.out.println("i: " + i + " " + intArray1[i] + " " +intArray2[i] + " " + intArray3[i]);
}
}
static void test1(boolean[] intMask, int[] intArray1, int[] intArray2, int[] intArray3) {
VectorMask<Integer> intVectorMask = VectorMask.fromArray(I_SPECIES, intMask, 0);
// Load values: 0 1 2 3 4 5 ...
IntVector a = IntVector.fromArray(I_SPECIES, intArray1, 0);
// Store, but only every second value:
// Store: 0 x 2 x 4 x ...
// Already there: 0 -2 -4 -6 -8 -10 ...
// Result: 0 -2 2 -6 4 -10 ...
a.intoArray(intArray2, 0, intVectorMask);
// Load, but we cannot just take the value from vector a, since we mask it, and where the mask is
// off, we must have zero.
// Load: 0 0 2 0 4 0 ...
IntVector b = IntVector.fromArray(I_SPECIES, intArray2, 0, intVectorMask);
b.intoArray(intArray3, 0);
}
}
This is not a bug yet, I think, but it could be if you do the change I suggested with store_Opcode()
.
Currently, if we have:
2035 StoreVectorMasked === 934 7 2031 1979 1956 |893 [[ 2034 2041 2066 ]] @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact[0] *, idx=7; mismatched Memory: @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact[0] *, idx=7; !jvms: IntVector::intoArray0Template @ bci:50 (line 3556) IntMaxVector::intoArray0 @ bci:9 (line 911) IntVector::intoArray @ bci:53 (line 3269) Test::test1 @ bci:26 (line 34)
2041 LoadVectorMasked === 934 2035 2031 1956 |893 [[ 1767 2066 1750 ]] @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact[0] *, idx=7; mismatched #vectorz[16]:{int} !jvms: IntVector::fromArray0Template @ bci:53 (line 3455) IntMaxVector::fromArray0 @ bci:11 (line 874) IntVector::fromArray @ bci:32 (line 2986) Test::test1 @ bci:36 (line 35)
The load does not succeed in MemNode::can_see_stored_value
, because the load's store_Opcode() == Op_StoreVector
, and not Op_StoreVectorMasked
. But if you were to implement LoadVectorMasked::store_Opcode() const { return Op_StoreVectorMasked; }
, then you have to be careful:
A masked load does not necessarily return the same as the masked store's input value. That input value is not yet masked, but the loaded value needs to be masked.
But it seems to me that you can actually never have a successful MemNode::can_see_stored_value
case for masked operations, with your current code. It would always fail the store_Opcode() == st->Opcode()
check. And for that gives the correct result, but it is still a bit strange that we don't override the store_Opcode
for the masked/offset vector stores.
I don't know which way you want to go now. I these options:
- Keep disallowing masked load/store "look-throughs".
- Do that by having the "incorrect"
store_Opcode
as now. The downside is that the "offset only" case does not manage to do the look-through, even though that would be correct. - OR: have the correct
store_Opcode
, which allows the look-through for the "offset only" case. But then explicitly check for the masked cases, and disallow those.
- Do that by having the "incorrect"
- Implement a special look-through, where you apply the mask with some blend/select/masked operation on the store input value, which simulates the masked load (i.e. you need to put zeros where the mask is off).
Not sure if this is all very clear, feel free to ask.
@dafedafe this pull request can not be integrated into master
due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:
git checkout JDK-8325520
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push
⚠️ @dafedafe This pull request contains merges that bring in commits not present in the target repository. Since this is not a "merge style" pull request, these changes will be squashed when this pull request in integrated. If this is your intention, then please ignore this message. If you want to preserve the commit structure, you must change the title of this pull request to Merge <project>:<branch>
where <project>
is the name of another project in the OpenJDK organization (for example Merge jdk:master
).
The load does not succeed in
MemNode::can_see_stored_value
, because the load'sstore_Opcode() == Op_StoreVector
, and notOp_StoreVectorMasked
. But if you were to implementLoadVectorMasked::store_Opcode() const { return Op_StoreVectorMasked; }
, then you have to be careful: A masked load does not necessarily return the same as the masked store's input value. That input value is not yet masked, but the loaded value needs to be masked.But it seems to me that you can actually never have a successful
MemNode::can_see_stored_value
case for masked operations, with your current code. It would always fail thestore_Opcode() == st->Opcode()
check. And for that gives the correct result, but it is still a bit strange that we don't override thestore_Opcode
for the masked/offset vector stores.I don't know which way you want to go now. I these options:
Keep disallowing masked load/store "look-throughs".
- Do that by having the "incorrect"
store_Opcode
as now. The downside is that the "offset only" case does not manage to do the look-through, even though that would be correct.- OR: have the correct
store_Opcode
, which allows the look-through for the "offset only" case. But then explicitly check for the masked cases, and disallow those.Implement a special look-through, where you apply the mask with some blend/select/masked operation on the store input value, which simulates the masked load (i.e. you need to put zeros where the mask is off).
Not sure if this is all very clear, feel free to ask.
Yep it is actually very clear, thanks a lot @eme64! Actually your example is with masks, but there is a similar problem with offsets as well. Think of what happens if the offset array is shorter than the length of the array or if there are duplicate indices in the offset array: the result of a load-store or a store-load sequence doesn’t produce the same original array or vector.
So actually to avoid folding in those 2 cases (load->store and store->load) for now I made specific store_Opcode
implementations that are never the same as a Store opcode, so that the store_Opcode() == st->Opcode()
and val->as_Load()->store_Opcode() == Opcode()
is always false (I’ve used the opcode of the LoadNode
itself but possibly there is a better solution).
As you hint one could also think of a solution involving substituting load/store masked vector nodes with “simple” vector mask nodes and load/store gather/scatter vector nodes with “simple” vector shuffle nodes (if these exist) but I’m not sure how much this would bring and anyway for this issue I would keep the fix simple. What do you think?
Nice, ah you are right, there can be issues with mask-only cases as well!
It would be great if you had tests that exactly exercise these "bad" examples, where it looks like we might optimize, but it would be wrong.
I'll look at your store_Opcode
changes now...
About storing a IntVector
to memory, and then loading as FloatVector
:
You can use a MemorySegment
:
// Wrap an array into a MemorySegment
MemorySegment ms = MemorySegment.fromArray(new byte[10_000]);
// Create your favourite int vector
IntVector intVector = ...;
// Store that int vector to the memory segment (internally, it does checkIndex and unsafe store to the byte array)
intVector.intoMemorySegment(ms, offset, ByteOrder.nativeOrder());
// Load a float vector from the memory segment (internally, it does checkIndex and unsafe load from the byte array)
FloatVector floatVector = FloatVector.fromMemorySegment(ms, offset, ByteOrder.nativeOrder())
I did not test this, but I think something like this should work.
// Load a float vector from the memory segment (internally, it does checkIndex and unsafe load from the byte array) FloatVector floatVector = FloatVector.fromMemorySegment(ms, offset, ByteOrder.nativeOrder())
I did not test this, but I think something like this should work.
Right! I totally missed that from
- intoMmemorySegment
methods! I guess that’s probably one reason why vector nodes are not typed.
I’ve added checks to StoreNode::Identity
as well.
It would be great if you had tests that exactly exercise these "bad" examples, where it looks like we might optimize, but it would be wrong.
Yep, good idea. I've added a few tests to check for those cases (load-store with duplicate offsets and store-load with masks). Thanks @eme64!
@dafedafe I also scanned quickly over the regression tests. I see at least two aspects missing:
- No mixed type test for load-store: Use MemorySegment
from/intoMmemorySegment
. Try something like store a int-vector, and load a float-vector. - Mismatched vector length: store a vector of length 4, and load one of length 8.
I think all of these are currently correctly handled by your vect_type
checks in the VM code, but it would be good to see that they are covered by regression tests, in case someone messes this up in the future.
Ah, some more missing cases:
- Do some store-store and store-load cases where you the first and second are different loads/stores, i.e. one with and one without mask/offsets. E.g.
StoreVectorMasked
andStoreVectorScatter
in a store-store test. Doing the total cross-product is probably too much, but a few examples would be a good start.
- No mixed type test for load-store: Use MemorySegment
from/intoMmemorySegment
. Try something like store a int-vector, and load a float-vector.
It looks as if load/stores that use from
/intoMemorySegment
with different types apparently don’t create LoadVector
nodes. It seems that fromMemorySegment
tries to inline the VectorSupport::load
intrinsic, but fails as the type of the vector and the inferred type of the underlying memory segment differ: https://github.com/openjdk/jdk/blob/9b742109b196d79cbf712ffd3f64edd1d6497114/src/hotspot/share/opto/vectorIntrinsics.cpp#L1055-L1064
🤔 I'm just wondering if we could get rid of the type check then...
- Mismatched vector length: store a vector of length 4, and load one of length 8.
I've added tests tests that store and load with different species (SPECIES_64
).
- Do some store-store and store-load cases where you the first and second are different loads/stores, i.e. one with and one without mask/offsets. E.g.
StoreVectorMasked
andStoreVectorScatter
in a store-store test. Doing the total cross-product is probably too much, but a few examples would be a good start.
You're right, there were just very few of them. Added many more.
It looks as if load/stores that use from/intoMemorySegment with different types apparently don’t create LoadVector nodes. It seems that fromMemorySegment tries to inline the VectorSupport::load intrinsic, but fails as the type of the vector and the inferred type of the underlying memory segment differ:
Ha, that seems to be a bit of an arbitrary (maybe just conservative) restriction. I hope we can lift that in the future. We do not have such restrictions for scalar array load/store. And we can also Auto-Vectorize those. Would be interesting to ask some VectorAPI folks about that.
Because we could allow those to use vectors in the future: I would leave the type checks in for now.
Looking at the tests now!
Nice, the tests look good, thanks for adding them!
Because we could allow those to use vectors in the future: I would leave the type checks in for now.
👍 better safe than sorry
This part in the PR description could be updated: now we return
-1
for those that we think are not "comparable".
You're right. Fixed. Thanks a lot for the review @eme64!!
@dafedafe I recommend that you update the Bug title. It names "offsets", but really you probably want to name "indices and mask", or just "masked and gather/scatter vector operations". And it also is not just about "loads", but also "stores", correct?
@dafedafe I recommend that you update the Bug title. It names "offsets", but really you probably want to name "indices and mask", or just "masked and gather/scatter vector operations". And it also is not just about "loads", but also "stores", correct?
Yep. I changed the title... and adapted the description as well (offsets -> indices). Thanks @eme64
/integrate
Going to push as commit 0c934ff4e2fb53a72ad25a080d956745a5649f9b.
Since your change was applied there have been 382 commits pushed to the master
branch:
- c099f14f07260713229cffbe7d23aa8305415a67: 8305457: Implement java.io.IO
- 6a35311468222f9335b43d548df2ecb80746b389: 8241550: [macOS] SSLSocketImpl/ReuseAddr.java failed due to "BindException: Address already in use"
- f16265d69b09640b972b7494ad57158dbdc426bb: 8332226: "Invalid package name:" from source launcher
- 5a2ba952b120394d7cc0d0890619780c1c27a078: 8325841: Remove unused references to vmSymbols.hpp
- 239c1b33b47de43369673f33d9449e1904477ce0: 8332807: Parallel: Make some APIs in ParMarkBitMap private
- 9b61a7608efff13fc3685488f3f54a810ec0ac22: 8332615: RISC-V: Support vector unsigned comparison instructions for machines with RVV
- a71b40478510db3c69696df608fd1b32f41c57f3: 8331398: G1: G1HeapRegionPrinter reclamation events should print the original region type
- af056c1676dab3b0b35666a8259db60f9bbf824e: 8332106: VerifyError when using switch pattern in this(...) or super(...)
- da3001daf79bf943d6194d9fd60250d519b9680d: 8331975: Enable case-insensitive check in ccache and keytab entry lookup
- 424eb60dedb332237b8ec97e9da6bd95442c0083: 8331683: Clean up GetCarrierThread
- ... and 372 more: https://git.openjdk.org/jdk/compare/174d62652c69e811cf44ab64db575b13a848a728...master
Your commit was automatically rebased without conflicts.