jdk 8327652: S390x: Implements SLP support

This PR Adds SIMD support on s390x.

Progress

[ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
[x] Change must not contain extraneous whitespace
[x] Commit message must refer to an issue

Issue

JDK-8327652: S390x: Implements SLP support (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/18162/head:pull/18162
$ git checkout pull/18162

Update a local copy of the PR:
$ git checkout pull/18162
$ git pull https://git.openjdk.org/jdk.git pull/18162/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 18162

View PR using the GUI difftool:
$ git pr show -t 18162

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/18162.diff

Webrev

Link to Webrev Comment

Mar 08 '24 05:03 sid8606

:wave: Welcome back sjayagond! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

Mar 08 '24 05:03 bridgekeeper[bot]

@sid8606 The following label will be automatically applied to this pull request:

hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

Mar 08 '24 05:03 openjdk[bot]

Webrevs

Mar 08 '24 05:03 mlbridge[bot]

@sid8606 This change is no longer ready for integration - check the PR body for details.

Mar 13 '24 20:03 openjdk[bot]

@RealLucy and @TheRealMDoerr please review this PR.

Mar 15 '24 12:03 sid8606

@TheRealMDoerr The total number of required reviews for this PR (including the jcheck configuration and the last /reviewers command) is now set to 2 (with at least 1 Reviewer, 1 Author).

Mar 22 '24 10:03 openjdk[bot]

@TheRealMDoerr I have tested with and without SuperwordUseVX enabled, assembler instructions review is being done by @offamitkumar. Thanks.

Mar 22 '24 15:03 sid8606

I think we shouldn't allow MacroAssembler::string_compress(...) and MacroAssembler::string_expand(...) to use vector registers without specifying this effect. That can be solved by adding a KILL effect for all vector registers which are killed. Alternatively, we could revert to the old implementation before https://github.com/openjdk/jdk/commit/d5adf1df921e5ecb8ff4c7e4349a12660069ed28 which doesn't use vector registers. The benefit was not huge if I remember correctly.

Mar 26 '24 20:03 TheRealMDoerr

I think we shouldn't allow MacroAssembler::string_compress(...) and MacroAssembler::string_expand(...) to use vector registers without specifying this effect. That can be solved by adding a KILL effect for all vector registers which are killed. Alternatively, we could revert to the old implementation before d5adf1d which doesn't use vector registers. The benefit was not huge if I remember correctly.

Agreed. My proposed circumvention is a too dirty hack.

I would prefer to add KILL effects to the match rules. I believe the vector implementation had a substantial performance effect. Unfortunately, I can't find any records of performance results from back then.

Reverting the commit @TheRealMDoerr mentioned is not possible. It contains many additions that may have been used by unrelated code. The vector code is well encapsulated and could be removed by deleting the

  if (VM_Version::has_VectorFacility()) {
  }

block. I would not like that, though.

Mar 27 '24 11:03 RealLucy

I think we shouldn't allow MacroAssembler::string_compress(...) and MacroAssembler::string_expand(...) to use vector registers without specifying this effect. That can be solved by adding a KILL effect for all vector registers which are killed. Alternatively, we could revert to the old implementation before d5adf1d which doesn't use vector registers. The benefit was not huge if I remember correctly.

Agreed. My proposed circumvention is a too dirty hack.

I would prefer to add KILL effects to the match rules. I believe the vector implementation had a substantial performance effect. Unfortunately, I can't find any records of performance results from back then.

Reverting the commit @TheRealMDoerr mentioned is not possible. It contains many additions that may have been used by unrelated code. The vector code is well encapsulated and could be removed by deleting the
  if (VM_Version::has_VectorFacility()) {
  } 
block. I would not like that, though.

I didn't mean to back out the whole commit. Only the implementation of string_compress and string_expand. The benefit of the vector version certainly depends on what kind of strings are used. (Effect may also be negative in some cases.) I think that classical benchmarks didn't show a significant performance impact, but I don't remember exactly, either. I'll leave the s390 maintainers free to decide if they want to adapt the vector version or go for the short and simple implementation.

Mar 27 '24 13:03 TheRealMDoerr

... I think that classical benchmarks didn't show a significant performance impact, but I don't remember exactly, either. ...

Yes, you need "long" strings (>= 32 characters) for the vector code to kick in. Once it kicks in, there is a performance improvement. Hard to say which applications might benefit. For too short strings I do not expect a visible performance penalty. It's just a shift and a branch.

Let the maintainers decide.

Mar 27 '24 16:03 RealLucy

@TheRealMDoerr and @RealLucy Just for my understanding why GPR and FPR doesn't get affected in intrinsic code as they are also allocated outside of register allocator? why only vector registers usage in intrinsic code get affected or am I missing anything here?

Apr 01 '24 07:04 sid8606

@TheRealMDoerr and @RealLucy Just for my understanding why GPR and FPR doesn't get affected in intrinsic code as they are also allocated outside of register allocator? why only vector registers usage in intrinsic code get affected or am I missing anything here?

GPRs and FPRs already have an effect specified in the match rules. (If a GPR or FPR is used by a MachNode without proper specification, it is a critical bug.) Before this PR, it is legal to use VRs without effect because they are not used by register allocation. This is exploited in MacroAssembler::string_compress(...) and MacroAssembler::string_expand(...).

With your change, these 2 intrinsics may overwrite live values! Unfortunately, they use many VRs. So, specifying a KILL effect for all of them may cause the register allocation to insert much spill code. May impact performance and code size.

Apr 01 '24 09:04 TheRealMDoerr

@sid8606 This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

Apr 29 '24 10:04 bridgekeeper[bot]

Still working on addressing intrinsic vector register clobbering. Commenting to keep PR Open.

May 07 '24 06:05 sid8606

@sid8606 This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

Jun 04 '24 08:06 bridgekeeper[bot]

Commenting to keep PR Open.

Jun 12 '24 02:06 sid8606

@sid8606 This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

Jul 10 '24 07:07 bridgekeeper[bot]

Keeping it open.

Aug 01 '24 03:08 sid8606

Hi @RealLucy, @TheRealMDoerr

Can we disable the vector part in the string_{compress, inflate, const_inflate} and integrate this for now. I am working on static stub part and need change from s390.ad file. Moreover I need these changes for Late barrier extension changes in the save live register class implementation.

Aug 06 '24 04:08 offamitkumar

I agree to disable the vector part in the above mentioned emitters. Please do so by enclosing the code in a

#if 0
 . . . 
#endif

block. And add a comment why the code was disabled. Add the changed file to this PR. I will then review and approve.

Aug 20 '24 14:08 RealLucy

There is another problem, Build is broken with these changes.

If we just pull the changes from this PR and do a build, this is the error I was getting:

/home/amit/jdk/src/jdk.compiler/share/classes/com/sun/tools/javac/comp/Lower.java warning: [dangling-doc-comments] documentation comment is not attached to any declaration

Then I thought maybe any dependency is causing issue and might have pick the latest changes; So I rebased it with head stream and was welcomed by this error:

* For target hotspot_variant-server_libjvm_objs_ad_s390.o:
/home/amit/jdk/src/hotspot/cpu/s390/s390.ad: In member function 'uint MachSpillCopyNode::implementation(C2_MacroAssembler*, PhaseRegAlloc*, bool, outputStream*) const':
/home/amit/jdk/src/hotspot/cpu/s390/s390.ad:1153:11: error: 'cbuf' was not declared in this scope
 1153 |       if (cbuf) {
      |           ^~~~
/home/amit/jdk/src/hotspot/cpu/s390/s390.ad:1160:11: error: 'cbuf' was not declared in this scope
 1160 |       if (cbuf) {
      |           ^~~~
/home/amit/jdk/src/hotspot/cpu/s390/s390.ad:1167:11: error: 'cbuf' was not declared in this scope
 1167 |       if (cbuf) {
      |           ^~~~
/home/amit/jdk/src/hotspot/cpu/s390/s390.ad:1175:11: error: 'cbuf' was not declared in this scope
 1175 |       if (cbuf) {
      |           ^~~~

Weird thing is that initially it was building fine with all of the VMs (fastdebug, slowdebug, release).

Aug 20 '24 16:08 offamitkumar

Hi Amit,

the function prototype now reads uint MachSpillCopyNode::implementation(C2_MacroAssembler *masm, PhaseRegAlloc *ra_, bool do_size, outputStream *os) const {

There is no cbuf argument anymore.

Side note: I would prefer to see If (masm != nullprt) Instead of If (masm)

Thanks, Lutz

From: Amit Kumar @.> Date: Tuesday, 20. August 2024 at 18:26 To: openjdk/jdk @.> Cc: Schmidt, Lutz @.>, Mention @.> Subject: Re: [openjdk/jdk] 8327652: S390x: Implements SLP support (PR #18162)

There is another problem, Build is broken with these changes.

If we just pull the changes from this PR and do a build, this is the error I was getting:

/home/amit/jdk/src/jdk.compiler/share/classes/com/sun/tools/javac/comp/Lower.java warning: [dangling-doc-comments] documentation comment is not attached to any declaration

Then I thought maybe any dependency is causing issue and might have pick the latest changes; So I rebased it with head stream and was welcomed by this error:

For target hotspot_variant-server_libjvm_objs_ad_s390.o:

/home/amit/jdk/src/hotspot/cpu/s390/s390.ad: In member function 'uint MachSpillCopyNode::implementation(C2_MacroAssembler*, PhaseRegAlloc*, bool, outputStream*) const':

/home/amit/jdk/src/hotspot/cpu/s390/s390.ad:1153:11: error: 'cbuf' was not declared in this scope

1153 | if (cbuf) {

  |           ^~~~

/home/amit/jdk/src/hotspot/cpu/s390/s390.ad:1160:11: error: 'cbuf' was not declared in this scope

1160 | if (cbuf) {

  |           ^~~~

/home/amit/jdk/src/hotspot/cpu/s390/s390.ad:1167:11: error: 'cbuf' was not declared in this scope

1167 | if (cbuf) {

  |           ^~~~

/home/amit/jdk/src/hotspot/cpu/s390/s390.ad:1175:11: error: 'cbuf' was not declared in this scope

1175 | if (cbuf) {

  |           ^~~~

Weird thing is that initially it was building fine with all of the VMs (fastdebug, slowdebug, release).

— Reply to this email directly, view it on GitHubhttps://github.com/openjdk/jdk/pull/18162#issuecomment-2299265908, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AQ6PCMMURK7MZVQYPHXR4STZSNUZ7AVCNFSM6AAAAABEMGVF5CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJZGI3DKOJQHA. You are receiving this because you were mentioned.Message ID: @.***>

Aug 20 '24 17:08 RealLucy

Hi @TheRealMDoerr,

I added changes on top of SLP PR: https://github.com/openjdk/jdk/commit/3fd8074a52f3cc8b9c3a8e588dd352f4e6c25a76 , and faced build break with this error:

* For target buildtools_create_symbols_javac__the.COMPILE_CREATE_SYMBOLS_batch:
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (/home/amit/jdk/src/hotspot/share/opto/chaitin.cpp:997), pid=3968896, tid=3969099
#  assert(lrgmask.is_aligned_sets(RegMask::SlotsPerVecX)) failed: vector should be aligned
#
# JRE version: OpenJDK Runtime Environment (24.0) (fastdebug build 24-internal-adhoc.amit.jdk)
# Java VM: OpenJDK 64-Bit Server VM (fastdebug 24-internal-adhoc.amit.jdk, mixed mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-s390x)
# Problematic frame:
# V  [libjvm.so+0x5a2d8a]  PhaseChaitin::gather_lrg_masks(bool) [clone .constprop.1]+0x175a
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" (or dumping to /home/amit/jdk/make/core.3968896)
#
# An error report file with more information is saved as:
# /home/amit/jdk/make/hs_err_pid3968896.log

BT:
Stack: [0x000003ff2f900000,0x000003ff2fd00000],  sp=0x000003ff2fcfb7d8,  free space=4077k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x5a2d8a]  PhaseChaitin::gather_lrg_masks(bool) [clone .constprop.1]+0x175a  (chaitin.cpp:997)
V  [libjvm.so+0x5aa122]  PhaseChaitin::Register_Allocate()+0x1ba  (chaitin.cpp:405)
V  [libjvm.so+0x6f181e]  Compile::Code_Gen()+0x2fe  (compile.cpp:2966)
V  [libjvm.so+0x6f4154]  Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0x1794  (compile.cpp:885)
V  [libjvm.so+0x52d32a]  C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0x1b2  (c2compiler.cpp:142)
V  [libjvm.so+0x7011ae]  CompileBroker::invoke_compiler_on_method(CompileTask*)+0xce6  (compileBroker.cpp:2303)
V  [libjvm.so+0x701cb2]  CompileBroker::compiler_thread_loop()+0x512  (compileBroker.cpp:1961)
V  [libjvm.so+0xb649c6]  JavaThread::thread_main_inner()+0xfe  (javaThread.cpp:758)
V  [libjvm.so+0x137066c]  Thread::call_run()+0xc4  (thread.cpp:225)
V  [libjvm.so+0x109470a]  thread_native_entry(Thread*)+0x132  (os_linux.cpp:858)

Do you think using Op_VecX is causing issue here and we should revert https://github.com/openjdk/jdk/pull/18162/commits/3caa470c0f89be306e5b43c5da4ca9e625abfe6b

Aug 26 '24 10:08 offamitkumar

Please don't revert it. The old code wasn't good. The problem should be understood and fixed. Otherwise, this PR should not get integrated. It's a high risk to integrate buggy vector code. We had tons of crashes on PPC64 with the initial version. Also see https://bugs.openjdk.org/browse/JDK-8188802. Does the CPU you were building on support the vector instructions (SuperwordUseVX and UseSFPV)?

Aug 26 '24 11:08 TheRealMDoerr

Please don't revert it. The old code wasn't good. The problem should be understood and fixed. Otherwise, this PR should not get integrated. It's a high risk to integrate buggy vector code. We had tons of crashes on PPC64 with the initial version. Also see https://bugs.openjdk.org/browse/JDK-8188802.

Mmm, but C2 is very much designed around 32-bit slots, and the register allocator assumes that every vec is some multiple of 32-bit slots. Here:

  // SlotsPerLong is 2, since slots are 32 bits and longs are 64 bits.
  // Also, consider the maximum alignment size for a normally allocated
  // value.  Since we allocate register pairs but not register quads (at
  // present), this alignment is SlotsPerLong (== 2).  A normally
  // aligned allocated register is either a single register, or a pair
  // of adjacent registers, the lower-numbered being even.
  // See also is_aligned_Pairs() below, and the padding added before
  // Matcher::_new_SP to keep allocated pairs aligned properly.
  // If we ever go to quad-word allocations, SlotsPerQuad will become
  // the controlling alignment constraint.  Note that this alignment
  // requirement is internal to the allocator, and independent of any
  // particular platform.
  enum { SlotsPerLong = 2,
         SlotsPerVecA = 4,
         SlotsPerVecS = 1,
         SlotsPerVecD = 2,
         SlotsPerVecX = 4,
         SlotsPerVecY = 8,
         SlotsPerVecZ = 16,
         SlotsPerRegVectMask = X86_ONLY(2) NOT_X86(1)
         };

This change that removes the subwords from the reg_def is wrong: https://github.com/openjdk/jdk/pull/18162/commits/3caa470c0f89be306e5b43c5da4ca9e625abfe6b . It is, in essence, lying to the register allocator. Sure, you can get away with this as long as you never have to refer directly to individual registers, which is what Amit's patch is doing.

We must ensure that every vector register, as declared, is a multiple of the correct number of slots in order for all this to work correctly.

Aug 26 '24 13:08 theRealAph

Thanks for the explanation, @theRealAph! I see aarch64 has a similar implementation. So, I'm ok with reverting. But please don't do a complete backout! At least the formatting changes etc. should be kept.

Aug 26 '24 14:08 TheRealMDoerr

@theRealAph: I didn't get the "lying to the register allocator" part. SlotsPerVecX is defined to be 4 slots and a slot has 4 Bytes. 16 Bytes is the correct vector width. Note that VecX is also used on PPC64 for example. Do you mean that we need to multiply the encoding() by 4? Can't that be fixed without reverting most of the new code? (only max_vr = max_fpr + VectorRegister::number_of_registers * VectorRegister::max_slots_per_register and (encoding() * VectorRegister::max_slots_per_register) + ConcreteRegisterImpl::max_fpr)

Aug 26 '24 14:08 TheRealMDoerr

@theRealAph: I didn't get the "lying to the register allocator" part. SlotsPerVecX is defined to be 4 slots and a slot has 4 Bytes. 16 Bytes is the correct vector width. Note that VecX is also used on PPC64 for example. Do you mean that we need to multiply the encoding() by 4? Can't that be fixed without reverting most of the new code? (only max_vr = max_fpr + VectorRegister::number_of_registers * VectorRegister::max_slots_per_register and (encoding() * VectorRegister::max_slots_per_register) + ConcreteRegisterImpl::max_fpr)

No, I mean that we need this:

  reg_def Z_VR16   ( SOC, SOC, Op_RegF, 16, Z_V16->as_VMReg()          );
  reg_def Z_VR16_H ( SOC, SOC, Op_RegF, 16, Z_V16->as_VMReg()->next()  );
  reg_def Z_VR16_J ( SOC, SOC, Op_RegF, 16, Z_V16->as_VMReg()->next(2) );
  reg_def Z_VR16_K ( SOC, SOC, Op_RegF, 16, Z_V16->as_VMReg()->next(3) );

and

reg_class z_v_reg(
  // Attention: Only these ones are saved & restored at safepoint by RegisterSaver.
  //1st 16 VRs overlaps with 1st 16 FPRs.
  Z_VR16, Z_VR16_H, Z_VR16_J, Z_VR16_K,

Because without the extra dummy register declarations, Z_VR16 is only allocated a single VMReg, but it needs to be allocated 4. I have no intention of touching anything else in this commit, just restoring the vector register declarations, which were working just fine.

Aug 26 '24 16:08 theRealAph

Ok, so on PPC64 we only have one VMReg per VectorRegister. Seems to work. I don't know why s390 needs the individual ones, but that may be better. Except in terms of resource usage and register allocation performance I guess.

Aug 26 '24 20:08 TheRealMDoerr

jdk jdk copied to clipboard

8327652: S390x: Implements SLP support

Progress

Issue

Reviewing

Webrev

Webrevs

jdk
jdk copied to clipboard