jdk
jdk copied to clipboard
move nmethods relocInfo data to C heap
Benchmark Mode Cnt Score Error Units
JmhDotty.runOperation ss 13 799.173 ± 8.989 ms/op
Statistics for 366 native nmethods:
N. total size = 214832
N. relocation = 6696
N. main code = 99604
N. oops = 352
N. metadata = 16
Statistics for 21054 bytecoded nmethods for C1:
total size = 111065192 (100%)
in CodeCache = 72805904 (65.552406%)
header = 5221392 (7.171660%)
relocation = 6663072 (9.151829%)
constants = 576 (0.000791%)
main code = 58382064 (80.188637%)
stub code = 5310688 (7.294310%)
oops = 513712 (0.705591%)
metadata = 3377472 (4.639009%)
immutable data = 31596216 (28.448351%)
dependencies = 630760 (1.996315%)
nul chk table = 790424 (2.501641%)
handler table = 181552 (0.574600%)
scopes pcs = 16425728 (51.986378%)
scopes data = 13567752 (42.941067%)
mutable data = 6663072 (5.999244%)
Statistics for 8168 bytecoded nmethods for C2:
total size = 63557016 (100%)
in CodeCache = 27389224 (43.093945%)
header = 2025664 (7.395843%)
relocation = 3363144 (12.279078%)
constants = 704 (0.002570%)
main code = 19860704 (72.512840%)
stub code = 2490304 (9.092277%)
oops = 293760 (1.072539%)
metadata = 2718088 (9.923932%)
immutable data = 32804648 (51.614521%)
dependencies = 872000 (2.658160%)
nul chk table = 519232 (1.582800%)
handler table = 1593032 (4.856117%)
scopes pcs = 16690832 (50.879471%)
scopes data = 13129552 (40.023449%)
mutable data = 3363144 (5.291539%)
Progress
- [ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
- [x] Change must not contain extraneous whitespace
- [ ] Commit message must refer to an issue
Reviewing
Using git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/21276/head:pull/21276
$ git checkout pull/21276
Update a local copy of the PR:
$ git checkout pull/21276
$ git pull https://git.openjdk.org/jdk.git pull/21276/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 21276
View PR using the GUI difftool:
$ git pr show -t 21276
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/21276.diff
:wave: Welcome back bulasevich! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.
@bulasevich This change now passes all automated pre-integration checks.
ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.
After integration, the commit message for the final commit will be:
8343789: Move mutable nmethod data out of CodeCache
Reviewed-by: kvn, dlong
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.
At the time when this comment was updated there had been 41 new commits pushed to the master branch:
- ffa63409884e9a2d41f5223ab5962980edbb008c: 8351567: Jar Manifest test ValueUtf8Coding produces misleading diagnostic output
- 8d8bd0c4b3bfdc7670bba03b01b0a00dac9f9825: 8349492: Update sun/security/pkcs12/KeytoolOpensslInteropTest.java to use a recent Openssl version
- 73465b9866da8e97e557d6ad80a95568ada3ee01: 8160327: Support for thumbnails present in APP1 marker for JPEG
- dbdbbd473061d7e8077ed07ccc6b03065a8c2ffc: 8348597: Update HarfBuzz to 10.4.0
- 7999091e3e976fe62d859d508bf649b6ec7bc94e: 8351555: Help section added in JDK-8350638 uses invalid HTML
- 8450ae902ee012b6447015b24369eee85c091ec5: 8351440: Link with -reproducible on macOS
- b40be22512a8d3b3350fef8d6668d80134a6f1a6: 8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges
- 6b84bdef3b203e62cebd77705ef5b3e081302c28: 8350007: Add usage message to the javadoc executable
- 32f2c2d80894552b8c5329cfa51c7e836314901f: 8351017: ChronoUnit.MONTHS.between() not giving correct result when date is in February
- d90b79a2bd2f8bb6e50aa32aafe47748ef6ebeff: 8351046: Rename ObjectMonitor functions
- ... and 31 more: https://git.openjdk.org/jdk/compare/cfab88b1a2351a187bc1be153be96ca983a7776c...master
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.
➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.
@bulasevich The following label will be automatically applied to this pull request:
hotspot-compiler
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.
@vnkozlov Hi Vladimir, What do you think about the idea of moving relocInfo data out of nmethod additionally to recent Move immutable nmethod data from CodeCache? It would reduce the CodeHeap fill by 5%.
Note, with https://bugs.openjdk.org/browse/JDK-8334691 and other changes I moving into direction to make relocation info data immutable. It is already "immutable" in mainline JDK after https://bugs.openjdk.org/browse/JDK-8333819. But it is still mutable in Leyden because we have to patch indexes during publishing nmethod.
My idea was to move relocation info data (which has big size) into immutable data section of nmethod. And leave mutable _oops and _metadata together with code since they are relatively small and we need to patch them together with code.
Mutable sizes % do not add up:
mutable data = 6071648 (9.396180%)
relocation = 3437176 (12.846409%)
oops = 239488 (0.895084%)
metadata = 2394984 (8.951227%)
Main question is: why you did it only for nmethod?
Yes. I did it symmetrically to a separate immutable data storage for nmethod. Now I see I do not like implementation where for some blobs relocation info is local, but for other it stays aside. I am going to rework that.
any performance effects with this change?
On the aarch machine I see a slight improvement on the big benchmark caused by the code sparsity improvement. Though I need to do more benchmarking to make sure I am not making things worse for others.
Benchmark Mode Cnt Score Error Units | Benchmark Mode Cnt Score Error Units
JmhDotty.runOperation ss 999 861.717 ± 1.543 ms/op | JmhDotty.runOperation ss 999 840.959 ± 1.473 ms/op
|
34555411781 cache-misses:u | 34343012187 cache-misses:u
2913869717708 cpu-cycles:u | 2863838151745 cpu-cycles:u
4185324759051 instructions:u | 4209616523046 instructions:u
1460914744576 L1-icache-loads:u | 1452066316397 L1-icache-loads:u
97806845375 L1-icache-load-misses:u | 93815390496 L1-icache-load-misses:u
1191854820746 iTLB-loads:u | 1169231847276 iTLB-loads:u
10591067761 iTLB-load-misses:u | 10134696419 iTLB-load-misses:u
838964735227 branch-loads:u | 838353168582 branch-loads:u
25829615231 branch-load-misses:u | 24361474411 branch-load-misses:u
836291984964 br_pred:u | 838153583659 br_pred:u
25733552818 br_mis_pred:u | 24353396612 br_mis_pred:u
562168308 group0-code_sparsity:u | 449848707 group0-code_sparsity:u
Mutable sizes % do not add up:
Thanks. The correct sizes:
Statistics for 21032 bytecoded nmethods for C1:
mutable data = 10488856 (8.688409%)
relocation = 6573064 (62.667118%)
oops = 515680 (4.916456%)
metadata = 3400112 (32.416424%)
Statistics for 8171 bytecoded nmethods for C2:
mutable data = 6572064 (10.118859%)
relocation = 3406216 (51.828709%)
oops = 305912 (4.654733%)
metadata = 2859936 (43.516556%)
My idea was to move relocation info data (which has big size) into
immutabledata section of nmethod. And leave mutable_oopsand_metadatatogether with code since they are relatively small and we need to patch them together with code.
Hmm. If relocation info goes to an immutable blob, oops+metadata hardly deserves a separate blob.
Performance update. On an aarch machine the CodeCacheStress benchmark shows a 1-2% performance improvement with this change,
Statistics on the CodeCacheStress benchmark with high numberOfClasses-instanceCount-rangeOfClasses parameter values:
- nmethod_count:134000, total_compilation_time: 510460ms
- total allocation time malloc_mutable/malloc_immutable/CodeCache_alloc: 62ms/114ms/6333ms,
- total allocation size (mutable/immutable/nmentod): 64MB/192MB/488MB
-XX:+PrintNMethodStatistics
Statistics for 21032 bytecoded nmethods for C1:
total size = 120722408 (100%)
in CodeCache = 79358808 (65.736603%)
header = 5215936 (6.572599%)
constants = 320 (0.000403%)
main code = 69017912 (86.969444%)
stub code = 5124640 (6.457557%)
mutable data = 10488856 (8.688409%)
relocation = 6573064 (62.667118%)
oops = 515680 (4.916456%)
metadata = 3400112 (32.416424%)
immutable data = 30874744 (25.574991%)
dependencies = 636240 (2.060714%)
nul chk table = 756920 (2.451583%)
handler table = 180456 (0.584478%)
scopes pcs = 16052608 (51.992683%)
scopes data = 13248520 (42.910542%)
Statistics for 8171 bytecoded nmethods for C2:
total size = 64948664 (100%)
in CodeCache = 25580504 (39.385727%)
header = 2026408 (7.921689%)
constants = 448 (0.001751%)
main code = 20925472 (81.802422%)
stub code = 2628176 (10.274137%)
mutable data = 6572064 (10.118859%)
relocation = 3406216 (51.828709%)
oops = 305912 (4.654733%)
metadata = 2859936 (43.516556%)
immutable data = 32796096 (50.495411%)
dependencies = 926992 (2.826532%)
nul chk table = 537024 (1.637463%)
handler table = 1695568 (5.170030%)
scopes pcs = 15451968 (47.115265%)
scopes data = 14184544 (43.250710%)
Webrevs
- 13: Full (bc8c590c)
- 12: Full (56c0cc78)
- 11: Full - Incremental (0dbd5029)
- 10: Full (6c3370be)
- 09: Full - Incremental (f2a9a7b2)
- 08: Full - Incremental (04c1aa06)
- 07: Full - Incremental (40fa9603)
- 06: Full - Incremental (2ee31cac)
- 05: Full - Incremental (6b97993a)
- 04: Full - Incremental (b4c7c24b)
- 03: Full - Incremental (ee697996)
- 02: Full - Incremental (27d27aa3)
- 01: Full - Incremental (f1a9d9a0)
- 00: Full (a358c6bc)
It would be nice to make relocations immutable, but one roadblock is the use of relocInfo::change_reloc_info_for_address() by C1 patching. We would need to separate mutable and immutable relocations, or replace C1 patching with deoptimization, like on DEOPTIMIZE_WHEN_PATCHING aarch64.
@bulasevich Please do not rebase or force-push to an active PR as it invalidates existing review comments. Note for future reference, the bots always squash all changes into a single commit automatically as part of the integration. See OpenJDK Developers’ Guide for more information.
/reviewers 2
@dean-long The total number of required reviews for this PR (including the jcheck configuration and the last /reviewers command) is now set to 2 (with at least 1 Reviewer, 1 Author).
I just noticed the old code would have replaced the oop_Relocation with a internal_word_Relocation for the NearCpool==false case. How did that ever work correctly?
Thanks for review!
NearCpool==false was not functioning correctly. I had considered a separate fix for it, but with the current change, the incorrect code is eliminated.
$ ~/jdk-23/bin/java -XX:-NearCpool -XX:+UseShenandoahGC -version
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x0000ffff6f94c190, pid=95263, tid=95264
#
# JRE version: OpenJDK Runtime Environment (23.0+38) (build 23+38)
# Java VM: OpenJDK 64-Bit Server VM (23+38, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, shenandoah gc, linux-aarch64)
# Problematic frame:
# j java.lang.module.ModuleDescriptor.<init>(Ljava/lang/String;Ljava/lang/module/ModuleDescriptor$Version;Ljava/util/Set;Ljava/util/Set;Ljava/util/Set;Ljava/util/Set;Ljava/util/Set;Ljava/util/Set;Ljava/util/Set;Ljava/lang/String;IZ)V+29 java.base
#
# Core dump will be written.
I see this failure in our testing with an internal test:
# A fatal error has been detected by the Java Runtime Environment:
#
# Internal Error (/workspace/open/src/hotspot/share/code/nmethod.cpp:1513), pid=1863674, tid=1863692
# assert(static_cast<int>(_metadata_offset) == reloc_size + oop_size) failed: failed: 21744 != 87280
Current CompileTask:
C2:61370 3693 b 4 org.apache.lucene.analysis.en.KStemData1::<clinit> (27623 bytes)
Stack: [0x00007f3acf3f5000,0x00007f3acf4f5000], sp=0x00007f3acf4f0a50, free space=1006k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [libjvm.so+0x14d9151] nmethod::nmethod(Method*, CompilerType, int, int, int, int, int, unsigned char*, CodeOffsets*, int, DebugInformationRecorder*, Dependencies*, CodeBuffer*, int, OopMapSet*, ExceptionHandlerTable*, ImplicitExceptionTable*, AbstractCompiler*, CompLevel, char*, int, JVMCINMethodData*)+0x981 (nmethod.cpp:1513)
V [libjvm.so+0x14d94df] nmethod::new_nmethod(methodHandle const&, int, int, CodeOffsets*, int, DebugInformationRecorder*, Dependencies*, CodeBuffer*, int, OopMapSet*, ExceptionHandlerTable*, ImplicitExceptionTable*, AbstractCompiler*, CompLevel, char*, int, JVMCINMethodData*)+0x22f (nmethod.cpp:1200)
V [libjvm.so+0x926cb4] ciEnv::register_method(ciMethod*, int, CodeOffsets*, int, CodeBuffer*, int, OopMapSet*, ExceptionHandlerTable*, ImplicitExceptionTable*, AbstractCompiler*, bool, bool, bool, bool, int)+0x4c4 (ciEnv.cpp:1063)
V [libjvm.so+0x157b0d5] PhaseOutput::install_code(ciMethod*, int, AbstractCompiler*, bool, bool)+0x125 (output.cpp:3443)
V [libjvm.so+0xa54d02] Compile::Code_Gen()+0x612 (compile.cpp:3033)
V [libjvm.so+0xa579cf] Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0x1c6f (compile.cpp:885)
V [libjvm.so+0x89f495] C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0x1d5 (c2compiler.cpp:142)
V [libjvm.so+0xa63a88] CompileBroker::invoke_compiler_on_method(CompileTask*)+0x928 (compileBroker.cpp:2319)
V [libjvm.so+0xa647c8] CompileBroker::compiler_thread_loop()+0x528 (compileBroker.cpp:1977)
V [libjvm.so+0xf3478e] JavaThread::thread_main_inner()+0xee (javaThread.cpp:777)
V [libjvm.so+0x1880536] Thread::call_run()+0xb6 (thread.cpp:232)
V [libjvm.so+0x155a188] thread_native_entry(Thread*)+0x128 (os_linux.cpp:860)
Looks like this also triggers with compiler/escapeAnalysis/TestFindInstMemRecursion.java on Linux x64 and -XX:-TieredCompilation -XX:+StressReflectiveCode -XX:-ReduceInitialCardMarks -XX:-ReduceBulkZeroing -XX:-ReduceFieldZeroing. This change needs more testing before integration.
@bulasevich Please do not rebase or force-push to an active PR as it invalidates existing review comments. Note for future reference, the bots always squash all changes into a single commit automatically as part of the integration. See OpenJDK Developers’ Guide for more information.
@TobiHartmann
I see this failure in our testing with an internal test
Thanks for catching that issue! I've addressed it.
This change needs more testing before integration.
Thank you for pointing that out. Here is my test plan for additional validation. Please let me know if you have any suggestions or if adjustments are needed:
Functional Testing:
- Platforms: x86 and AArch64
- Build Configurations: Release, Fastdebug, and Slowdebug
- Java Options: Default and -XX:+UseShenandoahGC
- Tests: jtreg tests covering Hotspot and JDK, tiers 1–3
Performance Testing:
- Benchmarks: DaCapo and Renaissance
- Modes: Default and -XX:+UseShenandoahGC
Sounds good. I'll run some testing as well and will report back once it passed.
@bulasevich, please wait my review before integration.
replace C1 patching with deoptimization, like on DEOPTIMIZE_WHEN_PATCHING aarch64.
It might be worth looking at that. My experiments at the time the AArch64 port was done indicated that only 10% of deoptimization on AArch64 was caused by patching events.
It is possible to generate C1 code for AArch64 that is patchable, but the frequent additional indirections in generated code are worse for performance than the occasional deoptimization.
I see more failures in testing:
compiler/jsr292/CallSiteDepContextTest.java fails on Windows with -XX:+UseZGC:
# A fatal error has been detected by the Java Runtime Environment:
#
# Internal Error (workspace\open\src\hotspot\share\code\dependencies.cpp:2038), pid=109676, tid=68072
# assert(call_site->is_a(vmClasses::CallSite_klass())) failed: sanity
Current thread (0x000002b4343ac600): JavaThread "MainThread" [_thread_in_vm, id=68072, stack(0x000000f121300000,0x000000f121400000) (1024K)]
Stack: [0x000000f121300000,0x000000f121400000], sp=0x000000f1213fe9f0, free space=1018k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [jvm.dll+0x601c63] Dependencies::check_call_site_target_value+0x3e3 (dependencies.cpp:2038)
V [jvm.dll+0x60182f] Dependencies::DepStream::check_call_site_dependency+0x6f (dependencies.cpp:2145)
V [jvm.dll+0xd0898d] nmethod::check_dependency_on+0x7d (nmethod.cpp:2853)
V [jvm.dll+0x609a19] DependencyContext::mark_dependent_nmethods+0xb9 (dependencyContext.cpp:74)
V [jvm.dll+0xcd2ff2] MethodHandles::mark_dependent_nmethods+0xa2 (methodHandles.cpp:953)
V [jvm.dll+0xccbf2e] MHN_setCallSiteTargetNormal+0x1ee (methodHandles.cpp:1211)
C 0x000002b02377ebac (no source info available)
Same test also fails with:
# A fatal error has been detected by the Java Runtime Environment:
#
# Internal Error (/workspace/open/src/hotspot/share/oops/compressedKlass.inline.hpp:88), pid=95938, tid=35843
# assert(nk >= _lowest_valid_narrow_klass_id && nk <= _highest_valid_narrow_klass_id) failed: narrowKlass ID out of range (3131947710)
--------------- T H R E A D ---------------
Current thread (0x0000000142865e10): JavaThread "MainThread" [_thread_in_vm, id=35843, stack(0x0000000170c4c000,0x0000000170e4f000) (2060K)]
Stack: [0x0000000170c4c000,0x0000000170e4f000], sp=0x0000000170e4dd00, free space=2055k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [libjvm.dylib+0x1161294] VMError::report(outputStream*, bool)+0x1aac (compressedKlass.inline.hpp:88)
V [libjvm.dylib+0x1164880] VMError::report_and_die(int, char const*, char const*, char*, Thread*, unsigned char*, void const*, void const*, char const*, int, unsigned long)+0x548
V [libjvm.dylib+0x576b0c] print_error_for_unit_test(char const*, char const*, char*)+0x0
V [libjvm.dylib+0x70a898] CompressedKlassPointers::check_encodable(void const*)+0x0
V [libjvm.dylib+0x585d24] oopDesc::klass() const+0x104
V [libjvm.dylib+0x5afbec] Dependencies::check_call_site_target_value(oop, oop, CallSiteDepChange*)+0xcc
V [libjvm.dylib+0x5b09e4] Dependencies::DepStream::check_call_site_dependency(CallSiteDepChange*)+0x84
V [libjvm.dylib+0xde663c] nmethod::check_dependency_on(DepChange&)+0x50
V [libjvm.dylib+0x5b291c] DependencyContext::mark_dependent_nmethods(DeoptimizationScope*, DepChange&)+0xa4
V [libjvm.dylib+0xda4dfc] MethodHandles::mark_dependent_nmethods(DeoptimizationScope*, Handle, Handle)+0xec
V [libjvm.dylib+0xda7ea4] MHN_setCallSiteTargetNormal+0x304
j java.lang.invoke.MethodHandleNatives.setCallSiteTargetNormal(Ljava/lang/invoke/CallSite;Ljava/lang/invoke/MethodHandle;)V+0 java.base@25-internal
j java.lang.invoke.CallSite.setTargetNormal(Ljava/lang/invoke/MethodHandle;)V+7 java.base@25-internal
j java.lang.invoke.MutableCallSite.setTarget(Ljava/lang/invoke/MethodHandle;)V+2 java.base@25-internal
j compiler.jsr292.CallSiteDepContextTest.testGC(ZZ)V+267
j compiler.jsr292.CallSiteDepContextTest.main([Ljava/lang/String;)V+11
j java.lang.invoke.LambdaForm$DMH+0x00000c0000001c00.invokeStatic(Ljava/lang/Object;Ljava/lang/Object;)V+10 java.base@25-internal
j java.lang.invoke.LambdaForm$MH+0x00000c0000003400.invoke(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+33 java.base@25-internal
j java.lang.invoke.Invokers$Holder.invokeExact_MT(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+20 java.base@25-internal
j jdk.internal.reflect.DirectMethodHandleAccessor.invokeImpl(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+55 java.base@25-internal
j jdk.internal.reflect.DirectMethodHandleAccessor.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+23 java.base@25-internal
j java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+102 java.base@25-internal
j com.sun.javatest.regtest.agent.MainWrapper$MainTask.run()V+134
j java.lang.Thread.runWith(Ljava/lang/Object;Ljava/lang/Runnable;)V+5 java.base@25-internal
j java.lang.Thread.run()V+19 java.base@25-internal
v ~StubRoutines::call_stub 0x00000001160d0180
V [libjvm.dylib+0x88beec] JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, JavaThread*)+0x448
V [libjvm.dylib+0x88ac08] JavaCalls::call_virtual(JavaValue*, Klass*, Symbol*, Symbol*, JavaCallArguments*, JavaThread*)+0x1c4
V [libjvm.dylib+0x88ada4] JavaCalls::call_virtual(JavaValue*, Handle, Klass*, Symbol*, Symbol*, JavaThread*)+0x6c
V [libjvm.dylib+0xa1289c] thread_entry(JavaThread*, JavaThread*)+0x12c
V [libjvm.dylib+0x8c2088] JavaThread::thread_main_inner()+0x1a8
V [libjvm.dylib+0x10ae844] Thread::call_run()+0xf4
V [libjvm.dylib+0xe467f0] thread_native_entry(Thread*)+0x138
C [libsystem_pthread.dylib+0x6f94] _pthread_start+0x88
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j java.lang.invoke.MethodHandleNatives.setCallSiteTargetNormal(Ljava/lang/invoke/CallSite;Ljava/lang/invoke/MethodHandle;)V+0 java.base@25-internal
j java.lang.invoke.CallSite.setTargetNormal(Ljava/lang/invoke/MethodHandle;)V+7 java.base@25-internal
j java.lang.invoke.MutableCallSite.setTarget(Ljava/lang/invoke/MethodHandle;)V+2 java.base@25-internal
j compiler.jsr292.CallSiteDepContextTest.testGC(ZZ)V+267
j compiler.jsr292.CallSiteDepContextTest.main([Ljava/lang/String;)V+11
j java.lang.invoke.LambdaForm$DMH+0x00000c0000001c00.invokeStatic(Ljava/lang/Object;Ljava/lang/Object;)V+10 java.base@25-internal
j java.lang.invoke.LambdaForm$MH+0x00000c0000003400.invoke(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+33 java.base@25-internal
j java.lang.invoke.Invokers$Holder.invokeExact_MT(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+20 java.base@25-internal
j jdk.internal.reflect.DirectMethodHandleAccessor.invokeImpl(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+55 java.base@25-internal
j jdk.internal.reflect.DirectMethodHandleAccessor.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+23 java.base@25-internal
j java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+102 java.base@25-internal
j com.sun.javatest.regtest.agent.MainWrapper$MainTask.run()V+134
j java.lang.Thread.runWith(Ljava/lang/Object;Ljava/lang/Runnable;)V+5 java.base@25-internal
j java.lang.Thread.run()V+19 java.base@25-internal
v ~StubRoutines::call_stub 0x00000001160d0180
Lock stack of current Java thread (top to bottom):
@stefank just pointed out that this could be a regression from JDK-8347564 that just went in.
@stefank just pointed out that this could be a regression from JDK-8347564 that just went in.
I thought that it could have been a regression but after looking closer I don't think it is. However, I do see a problem that the code removes the GC barriers when loading oops. I'll add an inline comment.
When using Shenadoah I am still seeing embedded compressed OOPs in the instruction stream. Is that correct? These form two instructions, like so:
movz(dst, 0xDEAD, 16);
movk(dst, 0xBEEF);
Do you want compressed OOPs to be moved out of CodeCache as well as uncompressed OOPs? If so, you should change loadConNNodein C2.
Mailing list message from Andrew Haley on hotspot-compiler-dev:
On 2/7/25 18:33, Boris Ulasevich wrote:
I think ADP+MOVK is better both in terms of performance and code density.
Good work, you may be right.
Neoverse N1 has a fairly narrow (4 wide) decoder, so I guess it's more likely to be limited by instruction count.
That benchmark isn't valid for GCC on my machine, because its outputs aren't used so GCC doesn't generate code for the asm.
However, if we change the benchmark to actually *do something* with the data (simply add the results together) we get this for movz+movk on Apple M1:
2,332,615,983 cycles:u # 3.135 GHz (95.26%) 18,660,205,348 instructions:u # 8.00 insn per cycle (95.26%)
and this for adrp+movk:
2,563,872,489 cycles:u # 3.057 GHz (96.03%) 14,357,197,644 instructions:u # 5.60 insn per cycle (96.03%)
Here we can see that the M1 is totally front-end limited: 8 ipc is the speed of light on an M1. Nonetheless, the timings are similar, with the win going to movz+movk.
On Neoverse V2, I also see an advantage for adrp+movk:
4162362189 cycles:u # 2.796 GHz 25002398111 instructions:u # 6.01 insn per cycle
3243420864 cycles:u # 2.796 GHz 21002398115 instructions:u # 6.48 insn per cycle
So, looks like adrp+movk has an overall advantage. I'm still somewhat skeptical that this usage really deserves a reloc handler of its own, though, given the usage.
If we do decide to do this, please give the forced movk version of adrp() a new name, and have adrp() call it.
Having said all of that, I'm not sure why we're seeing such different instruction counts for Apple M1 and Neoverse V2, I guess it must be the compiler but I don't know why, so take all of this with a big pinch of salt. For this really to be valid I guess we'd have to use the exact same binaries.
-- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Mailing list message from Andrew Haley on hotspot-compiler-dev:
On 2/8/25 10:38, Andrew Haley wrote:
On 2/7/25 18:33, Boris Ulasevich wrote:
I think ADP+MOVK is better both in terms of performance and code density.
Good work, you may be right.
Neoverse N1 has a fairly narrow (4 wide) decoder, so I guess it's more likely to be limited by instruction count.
That benchmark isn't valid for GCC on my machine, because its outputs aren't used so GCC doesn't generate code for the asm.
However, if we change the benchmark to actually *do something* with the data (simply add the results together) we get this for movz+movk on Apple M1:
Sorry, I messed up:
2,916,015,951 cycles:u # 3.151 GHz (95.33%) 22,773,874,629 instructions:u # 7.81 insn per cycle (95.33%)
adrp:
3,132,136,323 cycles:u # 3.114 GHz (96.60%) 18,420,464,246 instructions:u # 5.88 insn per cycle (96.60%)
-- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671