jdk 8291600: [vectorapi] vector cast op check is not always needed for vector mask cast

Recently we found the performance of "FIRST_NONZERO" for double type is largely worse than the other types on x86 when UseAVX=2. The main reason is the "VectorCastL2X" op is not supported by the backend when the dst element type is T_DOUBLE. This makes the check of VectorCast op fail before intrinsifying "VectorMask.cast()" which is used in the "FIRST_NONZERO" java implementation (see [1]). However, the compiler will not generate the VectorCast op for VectorMask.cast() if:

the current platform supports the predicated feature
the element size (in bytes) of the src and dst type is the same

So the check of "VectorCast" op is needless for such cases. To fix it, this patch:

limits the specified vector cast op check to vectors
adds the relative mask cast op check for VectorMask.cast()
cleans up the unnecessary codes

Here is the performance of "FIRST_NONZERO" benchmark [2] on a x86 machine with UseAVX=2:

Benchmark                          (size) Mode Cnt Before  After   Units
DoubleMaxVector.FIRST_NONZERO       1024  thrpt 15 49.266 2460.886 ops/ms
DoubleMaxVector.FIRST_NONZEROMasked 1024  thrpt 15 49.554 1892.223 ops/ms

[1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/DoubleVector.java#L770 [2] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/DoubleMaxVector.java#L246

Progress

[ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
[x] Change must not contain extraneous whitespace
[x] Commit message must refer to an issue

Issue

JDK-8291600: [vectorapi] vector cast op check is not always needed for vector mask cast

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/9737/head:pull/9737
$ git checkout pull/9737

Update a local copy of the PR:
$ git checkout pull/9737
$ git pull https://git.openjdk.org/jdk pull/9737/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 9737

View PR using the GUI difftool:
$ git pr show -t 9737

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/9737.diff

Aug 04 '22 06:08 XiaohongGong

:wave: Welcome back xgong! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

Aug 04 '22 06:08 bridgekeeper[bot]

@XiaohongGong The following label will be automatically applied to this pull request:

hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

Aug 04 '22 06:08 openjdk[bot]

Webrevs

Aug 04 '22 06:08 mlbridge[bot]

May I ask if we can use the VectorMaskCast nodes for the non-predicated mask casts. As the value of the elements can only be 0 or -1, we can generate better code on avx, as we don't need to truncate the elements first as in the general cases. Thanks a lot.

Aug 04 '22 16:08 merykitty

Thanks a lot for looking at this PR!

May I ask if we can use the VectorMaskCast nodes for the non-predicated mask casts. As the value of the elements can only be 0 or -1, we can generate better code on avx, as we don't need to truncate the elements first as in the general cases. Thanks a lot.

If the element size of src and dst type is equal, we can use VectorMaskCast which doesn't need to emit any instructions. But for others like S -> I, S->B, I'm afraid that we need the extending or narrowing instructions for non-predicated mask casts.

For example, for a short vector mask A with 128 bits:

A:  0000 1111 0000 1111 0000 1111 0000 1111

If we want to cast it to an int vector mask B with 256 bits, the result should be extended to:

B: 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111

And if we want to cast it to a byte vector mask C with 64 bits, the result should be narrowed to:

C: 00 11 00 11 00 11 00 11

And yes, for such cases, we can also generate a VectorMaskCast node and together with the relative backend match rules for it when the element size is different, which I think the codes will be the same with VectorCast. This makes the codes in mid-end cleaner, but may need some duplicate match rules. I'm not quite familiar with avx instructions, so for such cases, does avx has better instructions than the VectorCast for it? If so, I'd like to always use VectorMaskCast for all platforms and add the backend rules if needed. Also considering I'm not familiar with the x86 instructions, the change may needs your help. WDYT?

Aug 05 '22 01:08 XiaohongGong

Yes you are right, the code would be mostly the same, which means we can reuse the existing match rules to additionally match VectorMaskCast for those cases. For other cases, in particular, narrowing cast to subword types, since avx < 3 does not support truncation cast and only provides saturation cast instructions. We need to truncate the upper bits ourselves. For example, a cast from int to byte is done as follow

vpand dst, src, [external address mask]
vpackusdw dst, dst
vpermq dst, dst, 0x08
vpackuswb dst, dst

For vector mask cast, we can get rid of the first masking and use the vpacksss instead, which removes the need to reference memory. Thanks.

Aug 09 '22 16:08 merykitty

Yes you are right, the code would be mostly the same, which means we can reuse the existing match rules to additionally match VectorMaskCast for those cases. For other cases, in particular, narrowing cast to subword types, since avx < 3 does not support truncation cast and only provides saturation cast instructions. We need to truncate the upper bits ourselves. For example, a cast from int to byte is done as follow
vpand dst, src, [external address mask]
vpackusdw dst, dst
vpermq dst, dst, 0x08
vpackuswb dst, dst
For vector mask cast, we can get rid of the first masking and use the vpacksss instead, which removes the need to reference memory. Thanks.

I see, thanks! So would you like to provide the missing x86 backend implementation for VectorMaskCast ? If so we can use VectorMaskCast for all cases to simply the current codes? Thanks a lot!

Aug 15 '22 00:08 XiaohongGong

Yes you are right, the code would be mostly the same, which means we can reuse the existing match rules to additionally match VectorMaskCast for those cases. For other cases, in particular, narrowing cast to subword types, since avx < 3 does not support truncation cast and only provides saturation cast instructions. We need to truncate the upper bits ourselves. For example, a cast from int to byte is done as follow
vpand dst, src, [external address mask]
vpackusdw dst, dst
vpermq dst, dst, 0x08
vpackuswb dst, dst
For vector mask cast, we can get rid of the first masking and use the vpacksss instead, which removes the need to reference memory. Thanks.
I see, thanks! So would you like to provide the missing x86 backend implementation for VectorMaskCast ? If so we can use VectorMaskCast for all cases to simply the current codes? Thanks a lot!

Maybe I can refactor the codes in this patch, and add the same backend rules like VectorCast? And then you can create a followed-up patch to improve the x86 codegen if you like. WDYT?

Aug 16 '22 01:08 XiaohongGong

@XiaohongGong I have created a PR against your branch, this only contains changes in the x86 backend to avoid any conflicts with the changes you may have made, could you have a look? Thanks

https://github.com/XiaohongGong/jdk/pull/2

Aug 16 '22 02:08 merykitty

@XiaohongGong I have created a PR against your branch, this only contains changes in the x86 backend to avoid any conflicts with the changes you may have made, could you have a look? Thanks

XiaohongGong#2

Sure, thanks a lot! I will take a look.

Aug 16 '22 02:08 XiaohongGong

Hi @merykitty , thanks for the x86 backend implementation to the VectorMaskCast. I did a local refactor to the current code and tested with your backend patch, and no issue is found now. I guess there is no regression and I will run more benchmarks with it. BTW, I'd like to push the new changes to the github after https://github.com/openjdk/jdk/pull/9346 is merged because we have the dependence in the aarch64 backend rules. Thanks!

Aug 16 '22 08:08 XiaohongGong

/contributor add merykitty

Aug 18 '22 06:08 XiaohongGong

@XiaohongGong [merykitty](https://github.com/merykitty) was not found in the census.

Syntax: /contributor (add|remove) [@user | openjdk-user | Full Name <email@address>]. For example:

/contributor add @openjdk-bot
/contributor add duke
/contributor add J. Duke <[email protected]>

User names can only be used for users in the census associated with this repository. For other contributors you need to supply the full name and email address.

Aug 18 '22 06:08 openjdk[bot]

/contributor add merykitty

Aug 18 '22 06:08 XiaohongGong

@XiaohongGong merykitty was not found in the census.

Syntax: /contributor (add|remove) [@user | openjdk-user | Full Name <email@address>]. For example:

/contributor add @openjdk-bot
/contributor add duke
/contributor add J. Duke <[email protected]>

User names can only be used for users in the census associated with this repository. For other contributors you need to supply the full name and email address.

Aug 18 '22 06:08 openjdk[bot]

/contributor add @merykitty

Aug 18 '22 06:08 XiaohongGong

@XiaohongGong @merykitty was not found in the census.

Syntax: /contributor (add|remove) [@user | openjdk-user | Full Name <email@address>]. For example:

/contributor add @openjdk-bot
/contributor add duke
/contributor add J. Duke <[email protected]>

User names can only be used for users in the census associated with this repository. For other contributors you need to supply the full name and email address.

Aug 18 '22 06:08 openjdk[bot]

/contributor add qamai

Aug 18 '22 06:08 XiaohongGong

@XiaohongGong Contributor Quan Anh Mai <[email protected]> successfully added.

Aug 18 '22 06:08 openjdk[bot]

Hi, the new commit refactors the vector mask cast implementation that we always use the VectorMaskCast op instead of VectorCast op for all platforms. The backend implementation for VectorMaskCast can be cheaper than VectorCast for some architectures like x86 avx. Please take a look at the changes. Thanks a lot!

Aug 18 '22 06:08 XiaohongGong

The performance improvement is impressive on x86 with avx2.

Can you clarify what tests did you do? Thanks.

Aug 24 '22 01:08 DamonFool

The performance improvement is impressive on x86 with avx2.

Can you clarify what tests did you do? Thanks.

Hi, thanks for looking at this PR! To test the x86 backend rules, I ran "test/jdk/jdk/incubator/vector,test/hotspot/jtreg/compiler/vectorapi" group with -XX:UseAVX=1, -XX:UseAVX=2, and -XX:UseAVX=3 separately on an AVX-512 machine. For the whole patch testing, I ran hotspot::hotspot_all, jdk:tier1,tier2,tier3, langtools:tier1 on aarch64 and x86 machines. Thanks!

Aug 24 '22 06:08 XiaohongGong

Hi, the new commit reverted the changes to unify the vector mask cast operation with VectorMaskCast op, so that the whole patch can focus on fixing the performance issue on DoubleMaxVector.FIRST_NONZERO. A followed-up PR will specially refactor the vector mask casting. Please kindly review and share feedback. Thanks!

Aug 28 '22 14:08 XiaohongGong

Thanks for the update.

May I ask can we do it like this?

diff --git a/src/hotspot/share/opto/vectorIntrinsics.cpp b/src/hotspot/share/opto/vectorIntrinsics.cpp
index 66bacf0..c4e807a 100644
--- a/src/hotspot/share/opto/vectorIntrinsics.cpp
+++ b/src/hotspot/share/opto/vectorIntrinsics.cpp
@@ -2494,8 +2494,9 @@ bool LibraryCallKit::inline_vector_convert() {
       new_elem_bt_from = elem_bt_from == T_FLOAT ? T_INT : T_LONG;
     }
     int cast_vopc = VectorCastNode::opcode(new_elem_bt_from, !is_ucast);
+    bool no_vec_cast_check = is_mask && (type2aelembytes(elem_bt_from) == type2aelembytes(elem_bt_to));
     // Make sure that cast is implemented to particular type/size combination.
-    if (!arch_supports_vector(cast_vopc, num_elem_to, elem_bt_to, VecMaskNotUsed)) {
+    if (!no_vec_cast_check && !arch_supports_vector(cast_vopc, num_elem_to, elem_bt_to, VecMaskNotUsed)) {
       if (C->print_intrinsics()) {
         tty->print_cr("  ** not supported: arity=1 op=cast#%d/3 vlen2=%d etype2=%s ismask=%d",
                       cast_vopc,

Aug 29 '22 06:08 DamonFool

Thanks for the update.

May I ask can we do it like this?

diff --git a/src/hotspot/share/opto/vectorIntrinsics.cpp b/src/hotspot/share/opto/vectorIntrinsics.cpp
index 66bacf0..c4e807a 100644
--- a/src/hotspot/share/opto/vectorIntrinsics.cpp
+++ b/src/hotspot/share/opto/vectorIntrinsics.cpp
@@ -2494,8 +2494,9 @@ bool LibraryCallKit::inline_vector_convert() {
       new_elem_bt_from = elem_bt_from == T_FLOAT ? T_INT : T_LONG;
     }
     int cast_vopc = VectorCastNode::opcode(new_elem_bt_from, !is_ucast);
+    bool no_vec_cast_check = is_mask && (type2aelembytes(elem_bt_from) == type2aelembytes(elem_bt_to));
     // Make sure that cast is implemented to particular type/size combination.
-    if (!arch_supports_vector(cast_vopc, num_elem_to, elem_bt_to, VecMaskNotUsed)) {
+    if (!no_vec_cast_check && !arch_supports_vector(cast_vopc, num_elem_to, elem_bt_to, VecMaskNotUsed)) {
       if (C->print_intrinsics()) {
         tty->print_cr("  ** not supported: arity=1 op=cast#%d/3 vlen2=%d etype2=%s ismask=%d",
                       cast_vopc,

Thanks for the advice! The new commit has fixed this issue. Could you please check whether it is ok for you? Thanks!

Aug 31 '22 02:08 XiaohongGong

@XiaohongGong This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8291600: [vectorapi] vector cast op check is not always needed for vector mask cast

Co-authored-by: Quan Anh Mai <[email protected]>
Reviewed-by: jiefu, eliu, jbhateja

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 91 new commits pushed to the master branch:

aff5ff14b208b3c2be93d7b4fab8b07c5be12f3e: 8244681: Add a warning for possibly lossy conversion in compound assignments
15cb1fb7885a2fb5d7e51796552bae5ce0708cf5: 8256265: G1: Improve parallelism in regions that failed evacuation
b31a03c60a14e32304efe15fcd0031a752f4b4ab: 8293695: Implement isInfinite intrinsic for RISC-V
8f3bbe950fb5a3d9f6cae122209df01df0f342f0: 8293472: Incorrect container resource limit detection if manual cgroup fs mounts present
1caba0f13c42121c9e1c6648715ec7c31349b537: 8292948: JEditorPane ignores font-size styles in external linked css-file
eeb625e7095e65e64023cbfe05e579af90f4b638: 8290169: adlc: Improve child constraints for vector unary operations
2057070eb069ccee20760e47bd5e09590850d2ce: 8293815: P11PSSSignature.engineUpdate should not print debug messages during normal operation
7376c55219ce2107afb9197e2452e7122d86ef52: 8293769: RISC-V: Add a second temporary register for BarrierSetAssembler::load_at
d191e475168f203bb448c4977f8d6d77b8658d25: 8293768: Add links to JLS 19 and 20 from SourceVersion enum constants
a75ddb836b2de0e75a65dbfa3b2f240db07a7d31: 8293122: (fs) Use file cloning in macOS version of Files::copy method
... and 81 more: https://git.openjdk.org/jdk/compare/710a14347344f3cc136f3b7f41aad231fbe43625...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

Aug 31 '22 06:08 openjdk[bot]

LGTM Thanks for the update.

Thanks so much for the review!

Aug 31 '22 06:08 XiaohongGong

Hi @jatin-bhateja, could you please help to take a look at this PR? Thanks so much for your time!

Sep 02 '22 03:09 XiaohongGong

ping again. Could anyone else please help to take a look at this PR? Thanks a lot!

Sep 06 '22 02:09 XiaohongGong

The GHA failure is not related to this PR. So may I get an approve from someone else? Thanks in advance!

Sep 15 '22 01:09 XiaohongGong

jdk jdk copied to clipboard

8291600: [vectorapi] vector cast op check is not always needed for vector mask cast

Progress

Issue

Reviewing

Webrevs

jdk
jdk copied to clipboard