jdk
jdk copied to clipboard
8327652: S390x: Implements SLP support
This PR Adds SIMD support on s390x.
Progress
- [ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
- [x] Change must not contain extraneous whitespace
- [x] Commit message must refer to an issue
Issue
- JDK-8327652: S390x: Implements SLP support (Enhancement - P4)
Reviewing
Using git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/18162/head:pull/18162
$ git checkout pull/18162
Update a local copy of the PR:
$ git checkout pull/18162
$ git pull https://git.openjdk.org/jdk.git pull/18162/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 18162
View PR using the GUI difftool:
$ git pr show -t 18162
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/18162.diff
Webrev
:wave: Welcome back sjayagond! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.
@sid8606 The following label will be automatically applied to this pull request:
hotspot
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.
Webrevs
- 08: Full - Incremental (3f2af99e)
- 07: Full - Incremental (20a12fcf)
- 06: Full - Incremental (042c3966)
- 05: Full - Incremental (ad91ab30)
- 04: Full - Incremental (25c06768)
- 03: Full - Incremental (3caa470c)
- 02: Full - Incremental (487c57db)
- 01: Full - Incremental (8bb4bc33)
- 00: Full (00b90ab8)
@sid8606 This change is no longer ready for integration - check the PR body for details.
@RealLucy and @TheRealMDoerr please review this PR.
@TheRealMDoerr The total number of required reviews for this PR (including the jcheck configuration and the last /reviewers command) is now set to 2 (with at least 1 Reviewer, 1 Author).
@TheRealMDoerr I have tested with and without SuperwordUseVX enabled, assembler instructions review is being done by @offamitkumar. Thanks.
I think we shouldn't allow MacroAssembler::string_compress(...) and MacroAssembler::string_expand(...) to use vector registers without specifying this effect. That can be solved by adding a KILL effect for all vector registers which are killed. Alternatively, we could revert to the old implementation before https://github.com/openjdk/jdk/commit/d5adf1df921e5ecb8ff4c7e4349a12660069ed28 which doesn't use vector registers. The benefit was not huge if I remember correctly.
I think we shouldn't allow
MacroAssembler::string_compress(...)andMacroAssembler::string_expand(...)to use vector registers without specifying this effect. That can be solved by adding a KILL effect for all vector registers which are killed. Alternatively, we could revert to the old implementation before d5adf1d which doesn't use vector registers. The benefit was not huge if I remember correctly.
Agreed. My proposed circumvention is a too dirty hack.
I would prefer to add KILL effects to the match rules. I believe the vector implementation had a substantial performance effect. Unfortunately, I can't find any records of performance results from back then.
Reverting the commit @TheRealMDoerr mentioned is not possible. It contains many additions that may have been used by unrelated code. The vector code is well encapsulated and could be removed by deleting the
if (VM_Version::has_VectorFacility()) {
}
block. I would not like that, though.
I think we shouldn't allow
MacroAssembler::string_compress(...)andMacroAssembler::string_expand(...)to use vector registers without specifying this effect. That can be solved by adding a KILL effect for all vector registers which are killed. Alternatively, we could revert to the old implementation before d5adf1d which doesn't use vector registers. The benefit was not huge if I remember correctly.Agreed. My proposed circumvention is a too dirty hack.
I would prefer to add KILL effects to the match rules. I believe the vector implementation had a substantial performance effect. Unfortunately, I can't find any records of performance results from back then.
Reverting the commit @TheRealMDoerr mentioned is not possible. It contains many additions that may have been used by unrelated code. The vector code is well encapsulated and could be removed by deleting the
if (VM_Version::has_VectorFacility()) { }block. I would not like that, though.
I didn't mean to back out the whole commit. Only the implementation of string_compress and string_expand. The benefit of the vector version certainly depends on what kind of strings are used. (Effect may also be negative in some cases.) I think that classical benchmarks didn't show a significant performance impact, but I don't remember exactly, either. I'll leave the s390 maintainers free to decide if they want to adapt the vector version or go for the short and simple implementation.
... I think that classical benchmarks didn't show a significant performance impact, but I don't remember exactly, either. ...
Yes, you need "long" strings (>= 32 characters) for the vector code to kick in. Once it kicks in, there is a performance improvement. Hard to say which applications might benefit. For too short strings I do not expect a visible performance penalty. It's just a shift and a branch.
Let the maintainers decide.
@TheRealMDoerr and @RealLucy Just for my understanding why GPR and FPR doesn't get affected in intrinsic code as they are also allocated outside of register allocator? why only vector registers usage in intrinsic code get affected or am I missing anything here?
@TheRealMDoerr and @RealLucy Just for my understanding why GPR and FPR doesn't get affected in intrinsic code as they are also allocated outside of register allocator? why only vector registers usage in intrinsic code get affected or am I missing anything here?
GPRs and FPRs already have an effect specified in the match rules. (If a GPR or FPR is used by a MachNode without proper specification, it is a critical bug.) Before this PR, it is legal to use VRs without effect because they are not used by register allocation. This is exploited in MacroAssembler::string_compress(...) and MacroAssembler::string_expand(...).
With your change, these 2 intrinsics may overwrite live values! Unfortunately, they use many VRs. So, specifying a KILL effect for all of them may cause the register allocation to insert much spill code. May impact performance and code size.
@sid8606 This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!
Still working on addressing intrinsic vector register clobbering. Commenting to keep PR Open.
@sid8606 This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!
Commenting to keep PR Open.
@sid8606 This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!
Keeping it open.
Hi @RealLucy, @TheRealMDoerr
Can we disable the vector part in the string_{compress, inflate, const_inflate} and integrate this for now. I am working on static stub part and need change from s390.ad file. Moreover I need these changes for Late barrier extension changes in the save live register class implementation.
I agree to disable the vector part in the above mentioned emitters. Please do so by enclosing the code in a
#if 0
. . .
#endif
block. And add a comment why the code was disabled. Add the changed file to this PR. I will then review and approve.
There is another problem, Build is broken with these changes.
If we just pull the changes from this PR and do a build, this is the error I was getting:
/home/amit/jdk/src/jdk.compiler/share/classes/com/sun/tools/javac/comp/Lower.java warning: [dangling-doc-comments] documentation comment is not attached to any declaration
Then I thought maybe any dependency is causing issue and might have pick the latest changes; So I rebased it with head stream and was welcomed by this error:
* For target hotspot_variant-server_libjvm_objs_ad_s390.o:
/home/amit/jdk/src/hotspot/cpu/s390/s390.ad: In member function 'uint MachSpillCopyNode::implementation(C2_MacroAssembler*, PhaseRegAlloc*, bool, outputStream*) const':
/home/amit/jdk/src/hotspot/cpu/s390/s390.ad:1153:11: error: 'cbuf' was not declared in this scope
1153 | if (cbuf) {
| ^~~~
/home/amit/jdk/src/hotspot/cpu/s390/s390.ad:1160:11: error: 'cbuf' was not declared in this scope
1160 | if (cbuf) {
| ^~~~
/home/amit/jdk/src/hotspot/cpu/s390/s390.ad:1167:11: error: 'cbuf' was not declared in this scope
1167 | if (cbuf) {
| ^~~~
/home/amit/jdk/src/hotspot/cpu/s390/s390.ad:1175:11: error: 'cbuf' was not declared in this scope
1175 | if (cbuf) {
| ^~~~
Weird thing is that initially it was building fine with all of the VMs (fastdebug, slowdebug, release).
Hi Amit,
the function prototype now reads uint MachSpillCopyNode::implementation(C2_MacroAssembler *masm, PhaseRegAlloc *ra_, bool do_size, outputStream *os) const {
There is no cbuf argument anymore.
Side note: I would prefer to see If (masm != nullprt) Instead of If (masm)
Thanks, Lutz
From: Amit Kumar @.> Date: Tuesday, 20. August 2024 at 18:26 To: openjdk/jdk @.> Cc: Schmidt, Lutz @.>, Mention @.> Subject: Re: [openjdk/jdk] 8327652: S390x: Implements SLP support (PR #18162)
There is another problem, Build is broken with these changes.
If we just pull the changes from this PR and do a build, this is the error I was getting:
/home/amit/jdk/src/jdk.compiler/share/classes/com/sun/tools/javac/comp/Lower.java warning: [dangling-doc-comments] documentation comment is not attached to any declaration
Then I thought maybe any dependency is causing issue and might have pick the latest changes; So I rebased it with head stream and was welcomed by this error:
- For target hotspot_variant-server_libjvm_objs_ad_s390.o:
/home/amit/jdk/src/hotspot/cpu/s390/s390.ad: In member function 'uint MachSpillCopyNode::implementation(C2_MacroAssembler*, PhaseRegAlloc*, bool, outputStream*) const':
/home/amit/jdk/src/hotspot/cpu/s390/s390.ad:1153:11: error: 'cbuf' was not declared in this scope
1153 | if (cbuf) {
| ^~~~
/home/amit/jdk/src/hotspot/cpu/s390/s390.ad:1160:11: error: 'cbuf' was not declared in this scope
1160 | if (cbuf) {
| ^~~~
/home/amit/jdk/src/hotspot/cpu/s390/s390.ad:1167:11: error: 'cbuf' was not declared in this scope
1167 | if (cbuf) {
| ^~~~
/home/amit/jdk/src/hotspot/cpu/s390/s390.ad:1175:11: error: 'cbuf' was not declared in this scope
1175 | if (cbuf) {
| ^~~~
Weird thing is that initially it was building fine with all of the VMs (fastdebug, slowdebug, release).
— Reply to this email directly, view it on GitHubhttps://github.com/openjdk/jdk/pull/18162#issuecomment-2299265908, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AQ6PCMMURK7MZVQYPHXR4STZSNUZ7AVCNFSM6AAAAABEMGVF5CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJZGI3DKOJQHA. You are receiving this because you were mentioned.Message ID: @.***>
Hi @TheRealMDoerr,
I added changes on top of SLP PR: https://github.com/openjdk/jdk/commit/3fd8074a52f3cc8b9c3a8e588dd352f4e6c25a76 , and faced build break with this error:
* For target buildtools_create_symbols_javac__the.COMPILE_CREATE_SYMBOLS_batch:
#
# A fatal error has been detected by the Java Runtime Environment:
#
# Internal Error (/home/amit/jdk/src/hotspot/share/opto/chaitin.cpp:997), pid=3968896, tid=3969099
# assert(lrgmask.is_aligned_sets(RegMask::SlotsPerVecX)) failed: vector should be aligned
#
# JRE version: OpenJDK Runtime Environment (24.0) (fastdebug build 24-internal-adhoc.amit.jdk)
# Java VM: OpenJDK 64-Bit Server VM (fastdebug 24-internal-adhoc.amit.jdk, mixed mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-s390x)
# Problematic frame:
# V [libjvm.so+0x5a2d8a] PhaseChaitin::gather_lrg_masks(bool) [clone .constprop.1]+0x175a
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" (or dumping to /home/amit/jdk/make/core.3968896)
#
# An error report file with more information is saved as:
# /home/amit/jdk/make/hs_err_pid3968896.log
BT:
Stack: [0x000003ff2f900000,0x000003ff2fd00000], sp=0x000003ff2fcfb7d8, free space=4077k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [libjvm.so+0x5a2d8a] PhaseChaitin::gather_lrg_masks(bool) [clone .constprop.1]+0x175a (chaitin.cpp:997)
V [libjvm.so+0x5aa122] PhaseChaitin::Register_Allocate()+0x1ba (chaitin.cpp:405)
V [libjvm.so+0x6f181e] Compile::Code_Gen()+0x2fe (compile.cpp:2966)
V [libjvm.so+0x6f4154] Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0x1794 (compile.cpp:885)
V [libjvm.so+0x52d32a] C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0x1b2 (c2compiler.cpp:142)
V [libjvm.so+0x7011ae] CompileBroker::invoke_compiler_on_method(CompileTask*)+0xce6 (compileBroker.cpp:2303)
V [libjvm.so+0x701cb2] CompileBroker::compiler_thread_loop()+0x512 (compileBroker.cpp:1961)
V [libjvm.so+0xb649c6] JavaThread::thread_main_inner()+0xfe (javaThread.cpp:758)
V [libjvm.so+0x137066c] Thread::call_run()+0xc4 (thread.cpp:225)
V [libjvm.so+0x109470a] thread_native_entry(Thread*)+0x132 (os_linux.cpp:858)
Do you think using Op_VecX is causing issue here and we should revert https://github.com/openjdk/jdk/pull/18162/commits/3caa470c0f89be306e5b43c5da4ca9e625abfe6b
Please don't revert it. The old code wasn't good. The problem should be understood and fixed. Otherwise, this PR should not get integrated. It's a high risk to integrate buggy vector code. We had tons of crashes on PPC64 with the initial version. Also see https://bugs.openjdk.org/browse/JDK-8188802. Does the CPU you were building on support the vector instructions (SuperwordUseVX and UseSFPV)?
Please don't revert it. The old code wasn't good. The problem should be understood and fixed. Otherwise, this PR should not get integrated. It's a high risk to integrate buggy vector code. We had tons of crashes on PPC64 with the initial version. Also see https://bugs.openjdk.org/browse/JDK-8188802.
Mmm, but C2 is very much designed around 32-bit slots, and the register allocator assumes that every vec is some multiple of 32-bit slots. Here:
// SlotsPerLong is 2, since slots are 32 bits and longs are 64 bits.
// Also, consider the maximum alignment size for a normally allocated
// value. Since we allocate register pairs but not register quads (at
// present), this alignment is SlotsPerLong (== 2). A normally
// aligned allocated register is either a single register, or a pair
// of adjacent registers, the lower-numbered being even.
// See also is_aligned_Pairs() below, and the padding added before
// Matcher::_new_SP to keep allocated pairs aligned properly.
// If we ever go to quad-word allocations, SlotsPerQuad will become
// the controlling alignment constraint. Note that this alignment
// requirement is internal to the allocator, and independent of any
// particular platform.
enum { SlotsPerLong = 2,
SlotsPerVecA = 4,
SlotsPerVecS = 1,
SlotsPerVecD = 2,
SlotsPerVecX = 4,
SlotsPerVecY = 8,
SlotsPerVecZ = 16,
SlotsPerRegVectMask = X86_ONLY(2) NOT_X86(1)
};
This change that removes the subwords from the reg_def is wrong: https://github.com/openjdk/jdk/pull/18162/commits/3caa470c0f89be306e5b43c5da4ca9e625abfe6b . It is, in essence, lying to the register allocator. Sure, you can get away with this as long as you never have to refer directly to individual registers, which is what Amit's patch is doing.
We must ensure that every vector register, as declared, is a multiple of the correct number of slots in order for all this to work correctly.
Thanks for the explanation, @theRealAph! I see aarch64 has a similar implementation. So, I'm ok with reverting. But please don't do a complete backout! At least the formatting changes etc. should be kept.
@theRealAph: I didn't get the "lying to the register allocator" part. SlotsPerVecX is defined to be 4 slots and a slot has 4 Bytes. 16 Bytes is the correct vector width. Note that VecX is also used on PPC64 for example.
Do you mean that we need to multiply the encoding() by 4? Can't that be fixed without reverting most of the new code?
(only max_vr = max_fpr + VectorRegister::number_of_registers * VectorRegister::max_slots_per_register and (encoding() * VectorRegister::max_slots_per_register) + ConcreteRegisterImpl::max_fpr)
@theRealAph: I didn't get the "lying to the register allocator" part.
SlotsPerVecXis defined to be 4 slots and a slot has 4 Bytes. 16 Bytes is the correct vector width. Note that VecX is also used on PPC64 for example. Do you mean that we need to multiply the encoding() by 4? Can't that be fixed without reverting most of the new code? (onlymax_vr = max_fpr + VectorRegister::number_of_registers * VectorRegister::max_slots_per_registerand(encoding() * VectorRegister::max_slots_per_register) + ConcreteRegisterImpl::max_fpr)
No, I mean that we need this:
reg_def Z_VR16 ( SOC, SOC, Op_RegF, 16, Z_V16->as_VMReg() );
reg_def Z_VR16_H ( SOC, SOC, Op_RegF, 16, Z_V16->as_VMReg()->next() );
reg_def Z_VR16_J ( SOC, SOC, Op_RegF, 16, Z_V16->as_VMReg()->next(2) );
reg_def Z_VR16_K ( SOC, SOC, Op_RegF, 16, Z_V16->as_VMReg()->next(3) );
and
reg_class z_v_reg(
// Attention: Only these ones are saved & restored at safepoint by RegisterSaver.
//1st 16 VRs overlaps with 1st 16 FPRs.
Z_VR16, Z_VR16_H, Z_VR16_J, Z_VR16_K,
Because without the extra dummy register declarations, Z_VR16 is only allocated a single VMReg, but it needs to be allocated 4. I have no intention of touching anything else in this commit, just restoring the vector register declarations, which were working just fine.
Ok, so on PPC64 we only have one VMReg per VectorRegister. Seems to work. I don't know why s390 needs the individual ones, but that may be better. Except in terms of resource usage and register allocation performance I guess.