jdk
jdk copied to clipboard
8291600: [vectorapi] vector cast op check is not always needed for vector mask cast
Recently we found the performance of "FIRST_NONZERO
" for double type is largely worse than the other types on x86 when UseAVX=2
. The main reason is the "VectorCastL2X
" op is not supported by the backend when the dst element type is T_DOUBLE
. This makes the check of VectorCast
op fail before intrinsifying "VectorMask.cast()
" which is used in the
"FIRST_NONZERO
" java implementation (see [1]). However, the compiler will not generate the VectorCast
op for VectorMask.cast()
if:
- the current platform supports the predicated feature
- the element size (in bytes) of the src and dst type is the same
So the check of "VectorCast
" op is needless for such cases. To fix it, this patch:
- limits the specified vector cast op check to vectors
- adds the relative mask cast op check for VectorMask.cast()
- cleans up the unnecessary codes
Here is the performance of "FIRST_NONZERO
" benchmark [2] on a x86 machine with UseAVX=2
:
Benchmark (size) Mode Cnt Before After Units
DoubleMaxVector.FIRST_NONZERO 1024 thrpt 15 49.266 2460.886 ops/ms
DoubleMaxVector.FIRST_NONZEROMasked 1024 thrpt 15 49.554 1892.223 ops/ms
[1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/DoubleVector.java#L770 [2] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/DoubleMaxVector.java#L246
Progress
- [ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
- [x] Change must not contain extraneous whitespace
- [x] Commit message must refer to an issue
Issue
- JDK-8291600: [vectorapi] vector cast op check is not always needed for vector mask cast
Reviewing
Using git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/9737/head:pull/9737
$ git checkout pull/9737
Update a local copy of the PR:
$ git checkout pull/9737
$ git pull https://git.openjdk.org/jdk pull/9737/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 9737
View PR using the GUI difftool:
$ git pr show -t 9737
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/9737.diff
:wave: Welcome back xgong! A progress list of the required criteria for merging this PR into master
will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.
@XiaohongGong The following label will be automatically applied to this pull request:
-
hotspot-compiler
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.
Webrevs
- 07: Full - Incremental (c3aa861a)
- 06: Full (5a752b82)
- 05: Full - Incremental (cd751795)
- 04: Full - Incremental (d5b1e383)
- 03: Full - Incremental (37455218)
- 02: Full - Incremental (e13c8d57)
- 01: Full - Incremental (349ef765)
- 00: Full (3a06474a)
May I ask if we can use the VectorMaskCast
nodes for the non-predicated mask casts. As the value of the elements can only be 0 or -1, we can generate better code on avx, as we don't need to truncate the elements first as in the general cases. Thanks a lot.
Thanks a lot for looking at this PR!
May I ask if we can use the
VectorMaskCast
nodes for the non-predicated mask casts. As the value of the elements can only be 0 or -1, we can generate better code on avx, as we don't need to truncate the elements first as in the general cases. Thanks a lot.
If the element size of src and dst type is equal, we can use VectorMaskCast
which doesn't need to emit any instructions. But for others like S -> I
, S->B
, I'm afraid that we need the extending or narrowing instructions for non-predicated mask casts.
For example, for a short vector mask A with 128 bits:
A: 0000 1111 0000 1111 0000 1111 0000 1111
If we want to cast it to an int vector mask B with 256 bits, the result should be extended to:
B: 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111
And if we want to cast it to a byte vector mask C with 64 bits, the result should be narrowed to:
C: 00 11 00 11 00 11 00 11
And yes, for such cases, we can also generate a VectorMaskCast
node and together with the relative backend match rules for it when the element size is different, which I think the codes will be the same with VectorCast
. This makes the codes in mid-end cleaner, but may need some duplicate match rules. I'm not quite familiar with avx instructions, so for such cases, does avx has better instructions than the VectorCast for it? If so, I'd like to always use VectorMaskCast
for all platforms and add the backend rules if needed. Also considering I'm not familiar with the x86 instructions, the change may needs your help. WDYT?
Yes you are right, the code would be mostly the same, which means we can reuse the existing match rules to additionally match VectorMaskCast
for those cases. For other cases, in particular, narrowing cast to subword types, since avx < 3 does not support truncation cast and only provides saturation cast instructions. We need to truncate the upper bits ourselves. For example, a cast from int
to byte
is done as follow
vpand dst, src, [external address mask]
vpackusdw dst, dst
vpermq dst, dst, 0x08
vpackuswb dst, dst
For vector mask cast, we can get rid of the first masking and use the vpackss
s instead, which removes the need to reference memory. Thanks.
Yes you are right, the code would be mostly the same, which means we can reuse the existing match rules to additionally match
VectorMaskCast
for those cases. For other cases, in particular, narrowing cast to subword types, since avx < 3 does not support truncation cast and only provides saturation cast instructions. We need to truncate the upper bits ourselves. For example, a cast fromint
tobyte
is done as followvpand dst, src, [external address mask] vpackusdw dst, dst vpermq dst, dst, 0x08 vpackuswb dst, dst
For vector mask cast, we can get rid of the first masking and use the
vpackss
s instead, which removes the need to reference memory. Thanks.
I see, thanks! So would you like to provide the missing x86 backend implementation for VectorMaskCast
? If so we can use
VectorMaskCast
for all cases to simply the current codes? Thanks a lot!
Yes you are right, the code would be mostly the same, which means we can reuse the existing match rules to additionally match
VectorMaskCast
for those cases. For other cases, in particular, narrowing cast to subword types, since avx < 3 does not support truncation cast and only provides saturation cast instructions. We need to truncate the upper bits ourselves. For example, a cast fromint
tobyte
is done as followvpand dst, src, [external address mask] vpackusdw dst, dst vpermq dst, dst, 0x08 vpackuswb dst, dst
For vector mask cast, we can get rid of the first masking and use the
vpackss
s instead, which removes the need to reference memory. Thanks.I see, thanks! So would you like to provide the missing x86 backend implementation for
VectorMaskCast
? If so we can useVectorMaskCast
for all cases to simply the current codes? Thanks a lot!
Maybe I can refactor the codes in this patch, and add the same backend rules like VectorCast
? And then you can create a followed-up patch to improve the x86 codegen if you like. WDYT?
@XiaohongGong I have created a PR against your branch, this only contains changes in the x86 backend to avoid any conflicts with the changes you may have made, could you have a look? Thanks
https://github.com/XiaohongGong/jdk/pull/2
@XiaohongGong I have created a PR against your branch, this only contains changes in the x86 backend to avoid any conflicts with the changes you may have made, could you have a look? Thanks
Sure, thanks a lot! I will take a look.
Hi @merykitty , thanks for the x86 backend implementation to the VectorMaskCast. I did a local refactor to the current code and tested with your backend patch, and no issue is found now. I guess there is no regression and I will run more benchmarks with it. BTW, I'd like to push the new changes to the github after https://github.com/openjdk/jdk/pull/9346 is merged because we have the dependence in the aarch64 backend rules. Thanks!
/contributor add merykitty
@XiaohongGong [merykitty](https://github.com/merykitty)
was not found in the census.
Syntax: /contributor (add|remove) [@user | openjdk-user | Full Name <email@address>]
. For example:
-
/contributor add @openjdk-bot
-
/contributor add duke
-
/contributor add J. Duke <[email protected]>
User names can only be used for users in the census associated with this repository. For other contributors you need to supply the full name and email address.
/contributor add merykitty
@XiaohongGong merykitty
was not found in the census.
Syntax: /contributor (add|remove) [@user | openjdk-user | Full Name <email@address>]
. For example:
-
/contributor add @openjdk-bot
-
/contributor add duke
-
/contributor add J. Duke <[email protected]>
User names can only be used for users in the census associated with this repository. For other contributors you need to supply the full name and email address.
/contributor add @merykitty
@XiaohongGong @merykitty
was not found in the census.
Syntax: /contributor (add|remove) [@user | openjdk-user | Full Name <email@address>]
. For example:
-
/contributor add @openjdk-bot
-
/contributor add duke
-
/contributor add J. Duke <[email protected]>
User names can only be used for users in the census associated with this repository. For other contributors you need to supply the full name and email address.
/contributor add qamai
@XiaohongGong
Contributor Quan Anh Mai <[email protected]>
successfully added.
Hi, the new commit refactors the vector mask cast implementation that we always use the VectorMaskCast op instead of VectorCast op for all platforms. The backend implementation for VectorMaskCast can be cheaper than VectorCast for some architectures like x86 avx. Please take a look at the changes. Thanks a lot!
The performance improvement is impressive on x86 with avx2.
Can you clarify what tests did you do? Thanks.
The performance improvement is impressive on x86 with avx2.
Can you clarify what tests did you do? Thanks.
Hi, thanks for looking at this PR! To test the x86 backend rules, I ran "test/jdk/jdk/incubator/vector,test/hotspot/jtreg/compiler/vectorapi"
group with -XX:UseAVX=1, -XX:UseAVX=2, and -XX:UseAVX=3
separately on an AVX-512 machine. For the whole patch testing, I ran hotspot::hotspot_all, jdk:tier1,tier2,tier3, langtools:tier1
on aarch64 and x86 machines. Thanks!
Hi, the new commit reverted the changes to unify the vector mask cast operation with VectorMaskCast
op, so that the whole patch can focus on fixing the performance issue on DoubleMaxVector.FIRST_NONZERO
. A followed-up PR will specially refactor the vector mask casting. Please kindly review and share feedback. Thanks!
Thanks for the update.
May I ask can we do it like this?
diff --git a/src/hotspot/share/opto/vectorIntrinsics.cpp b/src/hotspot/share/opto/vectorIntrinsics.cpp
index 66bacf0..c4e807a 100644
--- a/src/hotspot/share/opto/vectorIntrinsics.cpp
+++ b/src/hotspot/share/opto/vectorIntrinsics.cpp
@@ -2494,8 +2494,9 @@ bool LibraryCallKit::inline_vector_convert() {
new_elem_bt_from = elem_bt_from == T_FLOAT ? T_INT : T_LONG;
}
int cast_vopc = VectorCastNode::opcode(new_elem_bt_from, !is_ucast);
+ bool no_vec_cast_check = is_mask && (type2aelembytes(elem_bt_from) == type2aelembytes(elem_bt_to));
// Make sure that cast is implemented to particular type/size combination.
- if (!arch_supports_vector(cast_vopc, num_elem_to, elem_bt_to, VecMaskNotUsed)) {
+ if (!no_vec_cast_check && !arch_supports_vector(cast_vopc, num_elem_to, elem_bt_to, VecMaskNotUsed)) {
if (C->print_intrinsics()) {
tty->print_cr(" ** not supported: arity=1 op=cast#%d/3 vlen2=%d etype2=%s ismask=%d",
cast_vopc,
Thanks for the update.
May I ask can we do it like this?
diff --git a/src/hotspot/share/opto/vectorIntrinsics.cpp b/src/hotspot/share/opto/vectorIntrinsics.cpp index 66bacf0..c4e807a 100644 --- a/src/hotspot/share/opto/vectorIntrinsics.cpp +++ b/src/hotspot/share/opto/vectorIntrinsics.cpp @@ -2494,8 +2494,9 @@ bool LibraryCallKit::inline_vector_convert() { new_elem_bt_from = elem_bt_from == T_FLOAT ? T_INT : T_LONG; } int cast_vopc = VectorCastNode::opcode(new_elem_bt_from, !is_ucast); + bool no_vec_cast_check = is_mask && (type2aelembytes(elem_bt_from) == type2aelembytes(elem_bt_to)); // Make sure that cast is implemented to particular type/size combination. - if (!arch_supports_vector(cast_vopc, num_elem_to, elem_bt_to, VecMaskNotUsed)) { + if (!no_vec_cast_check && !arch_supports_vector(cast_vopc, num_elem_to, elem_bt_to, VecMaskNotUsed)) { if (C->print_intrinsics()) { tty->print_cr(" ** not supported: arity=1 op=cast#%d/3 vlen2=%d etype2=%s ismask=%d", cast_vopc,
Thanks for the advice! The new commit has fixed this issue. Could you please check whether it is ok for you? Thanks!
@XiaohongGong This change now passes all automated pre-integration checks.
ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.
After integration, the commit message for the final commit will be:
8291600: [vectorapi] vector cast op check is not always needed for vector mask cast
Co-authored-by: Quan Anh Mai <[email protected]>
Reviewed-by: jiefu, eliu, jbhateja
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.
At the time when this comment was updated there had been 91 new commits pushed to the master
branch:
- aff5ff14b208b3c2be93d7b4fab8b07c5be12f3e: 8244681: Add a warning for possibly lossy conversion in compound assignments
- 15cb1fb7885a2fb5d7e51796552bae5ce0708cf5: 8256265: G1: Improve parallelism in regions that failed evacuation
- b31a03c60a14e32304efe15fcd0031a752f4b4ab: 8293695: Implement isInfinite intrinsic for RISC-V
- 8f3bbe950fb5a3d9f6cae122209df01df0f342f0: 8293472: Incorrect container resource limit detection if manual cgroup fs mounts present
- 1caba0f13c42121c9e1c6648715ec7c31349b537: 8292948: JEditorPane ignores font-size styles in external linked css-file
- eeb625e7095e65e64023cbfe05e579af90f4b638: 8290169: adlc: Improve child constraints for vector unary operations
- 2057070eb069ccee20760e47bd5e09590850d2ce: 8293815: P11PSSSignature.engineUpdate should not print debug messages during normal operation
- 7376c55219ce2107afb9197e2452e7122d86ef52: 8293769: RISC-V: Add a second temporary register for BarrierSetAssembler::load_at
- d191e475168f203bb448c4977f8d6d77b8658d25: 8293768: Add links to JLS 19 and 20 from SourceVersion enum constants
- a75ddb836b2de0e75a65dbfa3b2f240db07a7d31: 8293122: (fs) Use file cloning in macOS version of Files::copy method
- ... and 81 more: https://git.openjdk.org/jdk/compare/710a14347344f3cc136f3b7f41aad231fbe43625...master
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.
➡️ To integrate this PR with the above commit message to the master
branch, type /integrate in a new comment.
LGTM Thanks for the update.
Thanks so much for the review!
Hi @jatin-bhateja, could you please help to take a look at this PR? Thanks so much for your time!
ping again. Could anyone else please help to take a look at this PR? Thanks a lot!
The GHA failure is not related to this PR. So may I get an approve from someone else? Thanks in advance!