jdk icon indicating copy to clipboard operation
jdk copied to clipboard

move nmethods relocInfo data to C heap

Open bulasevich opened this issue 1 year ago • 4 comments

Benchmark              Mode  Cnt    Score   Error  Units
JmhDotty.runOperation    ss   13  799.173 ± 8.989  ms/op

Statistics for 366 native nmethods:
 N. total size  = 214832
 N. relocation  = 6696
 N. main code   = 99604
 N. oops        = 352
 N. metadata    = 16
Statistics for 21054 bytecoded nmethods for C1:
 total size      = 111065192 (100%)
 in CodeCache    = 72805904 (65.552406%)
   header        = 5221392 (7.171660%)
   relocation    = 6663072 (9.151829%)
   constants     = 576 (0.000791%)
   main code     = 58382064 (80.188637%)
   stub code     = 5310688 (7.294310%)
   oops          = 513712 (0.705591%)
   metadata      = 3377472 (4.639009%)
 immutable data  = 31596216 (28.448351%)
   dependencies  = 630760 (1.996315%)
   nul chk table = 790424 (2.501641%)
   handler table = 181552 (0.574600%)
   scopes pcs    = 16425728 (51.986378%)
   scopes data   = 13567752 (42.941067%)
 mutable data    = 6663072 (5.999244%)
Statistics for 8168 bytecoded nmethods for C2:
 total size      = 63557016 (100%)
 in CodeCache    = 27389224 (43.093945%)
   header        = 2025664 (7.395843%)
   relocation    = 3363144 (12.279078%)
   constants     = 704 (0.002570%)
   main code     = 19860704 (72.512840%)
   stub code     = 2490304 (9.092277%)
   oops          = 293760 (1.072539%)
   metadata      = 2718088 (9.923932%)
 immutable data  = 32804648 (51.614521%)
   dependencies  = 872000 (2.658160%)
   nul chk table = 519232 (1.582800%)
   handler table = 1593032 (4.856117%)
   scopes pcs    = 16690832 (50.879471%)
   scopes data   = 13129552 (40.023449%)
 mutable data    = 3363144 (5.291539%)

Progress

  • [ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • [x] Change must not contain extraneous whitespace
  • [ ] Commit message must refer to an issue

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/21276/head:pull/21276
$ git checkout pull/21276

Update a local copy of the PR:
$ git checkout pull/21276
$ git pull https://git.openjdk.org/jdk.git pull/21276/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 21276

View PR using the GUI difftool:
$ git pr show -t 21276

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/21276.diff

bulasevich avatar Oct 01 '24 02:10 bulasevich

:wave: Welcome back bulasevich! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

bridgekeeper[bot] avatar Oct 01 '24 02:10 bridgekeeper[bot]

@bulasevich This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8343789: Move mutable nmethod data out of CodeCache

Reviewed-by: kvn, dlong

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 41 new commits pushed to the master branch:

  • ffa63409884e9a2d41f5223ab5962980edbb008c: 8351567: Jar Manifest test ValueUtf8Coding produces misleading diagnostic output
  • 8d8bd0c4b3bfdc7670bba03b01b0a00dac9f9825: 8349492: Update sun/security/pkcs12/KeytoolOpensslInteropTest.java to use a recent Openssl version
  • 73465b9866da8e97e557d6ad80a95568ada3ee01: 8160327: Support for thumbnails present in APP1 marker for JPEG
  • dbdbbd473061d7e8077ed07ccc6b03065a8c2ffc: 8348597: Update HarfBuzz to 10.4.0
  • 7999091e3e976fe62d859d508bf649b6ec7bc94e: 8351555: Help section added in JDK-8350638 uses invalid HTML
  • 8450ae902ee012b6447015b24369eee85c091ec5: 8351440: Link with -reproducible on macOS
  • b40be22512a8d3b3350fef8d6668d80134a6f1a6: 8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges
  • 6b84bdef3b203e62cebd77705ef5b3e081302c28: 8350007: Add usage message to the javadoc executable
  • 32f2c2d80894552b8c5329cfa51c7e836314901f: 8351017: ChronoUnit.MONTHS.between() not giving correct result when date is in February
  • d90b79a2bd2f8bb6e50aa32aafe47748ef6ebeff: 8351046: Rename ObjectMonitor functions
  • ... and 31 more: https://git.openjdk.org/jdk/compare/cfab88b1a2351a187bc1be153be96ca983a7776c...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

openjdk[bot] avatar Oct 01 '24 02:10 openjdk[bot]

@bulasevich The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

openjdk[bot] avatar Oct 01 '24 02:10 openjdk[bot]

@vnkozlov Hi Vladimir, What do you think about the idea of ​​moving relocInfo data out of nmethod additionally to recent Move immutable nmethod data from CodeCache? It would reduce the CodeHeap fill by 5%.

bulasevich avatar Oct 03 '24 15:10 bulasevich

Note, with https://bugs.openjdk.org/browse/JDK-8334691 and other changes I moving into direction to make relocation info data immutable. It is already "immutable" in mainline JDK after https://bugs.openjdk.org/browse/JDK-8333819. But it is still mutable in Leyden because we have to patch indexes during publishing nmethod.

My idea was to move relocation info data (which has big size) into immutable data section of nmethod. And leave mutable _oops and _metadata together with code since they are relatively small and we need to patch them together with code.

vnkozlov avatar Nov 07 '24 23:11 vnkozlov

Mutable sizes % do not add up:

mutable data    = 6071648 (9.396180%)
   relocation    = 3437176 (12.846409%)
   oops          = 239488 (0.895084%)
   metadata      = 2394984 (8.951227%)

vnkozlov avatar Nov 07 '24 23:11 vnkozlov

Main question is: why you did it only for nmethod?

Yes. I did it symmetrically to a separate immutable data storage for nmethod. Now I see I do not like implementation where for some blobs relocation info is local, but for other it stays aside. I am going to rework that.

any performance effects with this change?

On the aarch machine I see a slight improvement on the big benchmark caused by the code sparsity improvement. Though I need to do more benchmarking to make sure I am not making things worse for others.

Benchmark              Mode  Cnt    Score   Error  Units  |  Benchmark              Mode  Cnt    Score   Error  Units
JmhDotty.runOperation    ss  999  861.717 ± 1.543  ms/op  |  JmhDotty.runOperation    ss  999  840.959 ± 1.473  ms/op
                                                          |
       34555411781      cache-misses:u                    |         34343012187      cache-misses:u
     2913869717708      cpu-cycles:u                      |       2863838151745      cpu-cycles:u
     4185324759051      instructions:u                    |       4209616523046      instructions:u            
     1460914744576      L1-icache-loads:u                 |       1452066316397      L1-icache-loads:u
       97806845375      L1-icache-load-misses:u           |         93815390496      L1-icache-load-misses:u   
     1191854820746      iTLB-loads:u                      |       1169231847276      iTLB-loads:u
       10591067761      iTLB-load-misses:u                |         10134696419      iTLB-load-misses:u        
      838964735227      branch-loads:u                    |        838353168582      branch-loads:u
       25829615231      branch-load-misses:u              |         24361474411      branch-load-misses:u
      836291984964      br_pred:u                         |        838153583659      br_pred:u
       25733552818      br_mis_pred:u                     |         24353396612      br_mis_pred:u
         562168308      group0-code_sparsity:u            |           449848707      group0-code_sparsity:u

bulasevich avatar Nov 09 '24 11:11 bulasevich

Mutable sizes % do not add up:

Thanks. The correct sizes:

Statistics for 21032 bytecoded nmethods for C1:
 mutable data    = 10488856 (8.688409%)
   relocation    = 6573064 (62.667118%)
   oops          = 515680 (4.916456%)
   metadata      = 3400112 (32.416424%)
Statistics for 8171 bytecoded nmethods for C2:
 mutable data    = 6572064 (10.118859%)
   relocation    = 3406216 (51.828709%)
   oops          = 305912 (4.654733%)
   metadata      = 2859936 (43.516556%)

bulasevich avatar Nov 09 '24 11:11 bulasevich

My idea was to move relocation info data (which has big size) into immutable data section of nmethod. And leave mutable _oops and _metadata together with code since they are relatively small and we need to patch them together with code.

Hmm. If relocation info goes to an immutable blob, oops+metadata hardly deserves a separate blob.

bulasevich avatar Nov 09 '24 11:11 bulasevich

Performance update. On an aarch machine the CodeCacheStress benchmark shows a 1-2% performance improvement with this change,

Statistics on the CodeCacheStress benchmark with high numberOfClasses-instanceCount-rangeOfClasses parameter values:

  • nmethod_count:134000, total_compilation_time: 510460ms
  • total allocation time malloc_mutable/malloc_immutable/CodeCache_alloc: 62ms/114ms/6333ms,
  • total allocation size (mutable/immutable/nmentod): 64MB/192MB/488MB

bulasevich avatar Nov 14 '24 13:11 bulasevich

-XX:+PrintNMethodStatistics

Statistics for 21032 bytecoded nmethods for C1:
 total size      = 120722408 (100%)
 in CodeCache    = 79358808 (65.736603%)
   header        = 5215936 (6.572599%)
   constants     = 320 (0.000403%)
   main code     = 69017912 (86.969444%)
   stub code     = 5124640 (6.457557%)
 mutable data    = 10488856 (8.688409%)
   relocation    = 6573064 (62.667118%)
   oops          = 515680 (4.916456%)
   metadata      = 3400112 (32.416424%)
 immutable data  = 30874744 (25.574991%)
   dependencies  = 636240 (2.060714%)
   nul chk table = 756920 (2.451583%)
   handler table = 180456 (0.584478%)
   scopes pcs    = 16052608 (51.992683%)
   scopes data   = 13248520 (42.910542%)
Statistics for 8171 bytecoded nmethods for C2:
 total size      = 64948664 (100%)
 in CodeCache    = 25580504 (39.385727%)
   header        = 2026408 (7.921689%)
   constants     = 448 (0.001751%)
   main code     = 20925472 (81.802422%)
   stub code     = 2628176 (10.274137%)
 mutable data    = 6572064 (10.118859%)
   relocation    = 3406216 (51.828709%)
   oops          = 305912 (4.654733%)
   metadata      = 2859936 (43.516556%)
 immutable data  = 32796096 (50.495411%)
   dependencies  = 926992 (2.826532%)
   nul chk table = 537024 (1.637463%)
   handler table = 1695568 (5.170030%)
   scopes pcs    = 15451968 (47.115265%)
   scopes data   = 14184544 (43.250710%)

bulasevich avatar Nov 21 '24 13:11 bulasevich

It would be nice to make relocations immutable, but one roadblock is the use of relocInfo::change_reloc_info_for_address() by C1 patching. We would need to separate mutable and immutable relocations, or replace C1 patching with deoptimization, like on DEOPTIMIZE_WHEN_PATCHING aarch64.

dean-long avatar Nov 22 '24 00:11 dean-long

@bulasevich Please do not rebase or force-push to an active PR as it invalidates existing review comments. Note for future reference, the bots always squash all changes into a single commit automatically as part of the integration. See OpenJDK Developers’ Guide for more information.

openjdk[bot] avatar Dec 11 '24 13:12 openjdk[bot]

/reviewers 2

dean-long avatar Jan 09 '25 01:01 dean-long

@dean-long The total number of required reviews for this PR (including the jcheck configuration and the last /reviewers command) is now set to 2 (with at least 1 Reviewer, 1 Author).

openjdk[bot] avatar Jan 09 '25 01:01 openjdk[bot]

I just noticed the old code would have replaced the oop_Relocation with a internal_word_Relocation for the NearCpool==false case. How did that ever work correctly?

Thanks for review!

NearCpool==false was not functioning correctly. I had considered a separate fix for it, but with the current change, the incorrect code is eliminated.

$ ~/jdk-23/bin/java -XX:-NearCpool -XX:+UseShenandoahGC -version
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000ffff6f94c190, pid=95263, tid=95264
#
# JRE version: OpenJDK Runtime Environment (23.0+38) (build 23+38)
# Java VM: OpenJDK 64-Bit Server VM (23+38, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, shenandoah gc, linux-aarch64)
# Problematic frame:
# j  java.lang.module.ModuleDescriptor.<init>(Ljava/lang/String;Ljava/lang/module/ModuleDescriptor$Version;Ljava/util/Set;Ljava/util/Set;Ljava/util/Set;Ljava/util/Set;Ljava/util/Set;Ljava/util/Set;Ljava/util/Set;Ljava/lang/String;IZ)V+29 java.base
#
# Core dump will be written.

bulasevich avatar Jan 10 '25 11:01 bulasevich

I see this failure in our testing with an internal test:

# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (/workspace/open/src/hotspot/share/code/nmethod.cpp:1513), pid=1863674, tid=1863692
#  assert(static_cast<int>(_metadata_offset) == reloc_size + oop_size) failed: failed: 21744 != 87280

Current CompileTask:
C2:61370 3693    b  4       org.apache.lucene.analysis.en.KStemData1::<clinit> (27623 bytes)

Stack: [0x00007f3acf3f5000,0x00007f3acf4f5000],  sp=0x00007f3acf4f0a50,  free space=1006k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x14d9151]  nmethod::nmethod(Method*, CompilerType, int, int, int, int, int, unsigned char*, CodeOffsets*, int, DebugInformationRecorder*, Dependencies*, CodeBuffer*, int, OopMapSet*, ExceptionHandlerTable*, ImplicitExceptionTable*, AbstractCompiler*, CompLevel, char*, int, JVMCINMethodData*)+0x981  (nmethod.cpp:1513)
V  [libjvm.so+0x14d94df]  nmethod::new_nmethod(methodHandle const&, int, int, CodeOffsets*, int, DebugInformationRecorder*, Dependencies*, CodeBuffer*, int, OopMapSet*, ExceptionHandlerTable*, ImplicitExceptionTable*, AbstractCompiler*, CompLevel, char*, int, JVMCINMethodData*)+0x22f  (nmethod.cpp:1200)
V  [libjvm.so+0x926cb4]  ciEnv::register_method(ciMethod*, int, CodeOffsets*, int, CodeBuffer*, int, OopMapSet*, ExceptionHandlerTable*, ImplicitExceptionTable*, AbstractCompiler*, bool, bool, bool, bool, int)+0x4c4  (ciEnv.cpp:1063)
V  [libjvm.so+0x157b0d5]  PhaseOutput::install_code(ciMethod*, int, AbstractCompiler*, bool, bool)+0x125  (output.cpp:3443)
V  [libjvm.so+0xa54d02]  Compile::Code_Gen()+0x612  (compile.cpp:3033)
V  [libjvm.so+0xa579cf]  Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0x1c6f  (compile.cpp:885)
V  [libjvm.so+0x89f495]  C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0x1d5  (c2compiler.cpp:142)
V  [libjvm.so+0xa63a88]  CompileBroker::invoke_compiler_on_method(CompileTask*)+0x928  (compileBroker.cpp:2319)
V  [libjvm.so+0xa647c8]  CompileBroker::compiler_thread_loop()+0x528  (compileBroker.cpp:1977)
V  [libjvm.so+0xf3478e]  JavaThread::thread_main_inner()+0xee  (javaThread.cpp:777)
V  [libjvm.so+0x1880536]  Thread::call_run()+0xb6  (thread.cpp:232)
V  [libjvm.so+0x155a188]  thread_native_entry(Thread*)+0x128  (os_linux.cpp:860)

TobiHartmann avatar Jan 14 '25 09:01 TobiHartmann

Looks like this also triggers with compiler/escapeAnalysis/TestFindInstMemRecursion.java on Linux x64 and -XX:-TieredCompilation -XX:+StressReflectiveCode -XX:-ReduceInitialCardMarks -XX:-ReduceBulkZeroing -XX:-ReduceFieldZeroing. This change needs more testing before integration.

TobiHartmann avatar Jan 14 '25 09:01 TobiHartmann

@bulasevich Please do not rebase or force-push to an active PR as it invalidates existing review comments. Note for future reference, the bots always squash all changes into a single commit automatically as part of the integration. See OpenJDK Developers’ Guide for more information.

openjdk[bot] avatar Jan 15 '25 10:01 openjdk[bot]

@TobiHartmann

I see this failure in our testing with an internal test

Thanks for catching that issue! I've addressed it.

This change needs more testing before integration.

Thank you for pointing that out. Here is my test plan for additional validation. Please let me know if you have any suggestions or if adjustments are needed:

Functional Testing:

  • Platforms: x86 and AArch64
  • Build Configurations: Release, Fastdebug, and Slowdebug
  • Java Options: Default and -XX:+UseShenandoahGC
  • Tests: jtreg tests covering Hotspot and JDK, tiers 1–3

Performance Testing:

  • Benchmarks: DaCapo and Renaissance
  • Modes: Default and -XX:+UseShenandoahGC

bulasevich avatar Jan 28 '25 12:01 bulasevich

Sounds good. I'll run some testing as well and will report back once it passed.

TobiHartmann avatar Jan 31 '25 09:01 TobiHartmann

@bulasevich, please wait my review before integration.

vnkozlov avatar Jan 31 '25 19:01 vnkozlov

replace C1 patching with deoptimization, like on DEOPTIMIZE_WHEN_PATCHING aarch64.

It might be worth looking at that. My experiments at the time the AArch64 port was done indicated that only 10% of deoptimization on AArch64 was caused by patching events.

It is possible to generate C1 code for AArch64 that is patchable, but the frequent additional indirections in generated code are worse for performance than the occasional deoptimization.

theRealAph avatar Feb 02 '25 17:02 theRealAph

I see more failures in testing:

compiler/jsr292/CallSiteDepContextTest.java fails on Windows with -XX:+UseZGC:

# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (workspace\open\src\hotspot\share\code\dependencies.cpp:2038), pid=109676, tid=68072
#  assert(call_site->is_a(vmClasses::CallSite_klass())) failed: sanity

Current thread (0x000002b4343ac600):  JavaThread "MainThread"        [_thread_in_vm, id=68072, stack(0x000000f121300000,0x000000f121400000) (1024K)]

Stack: [0x000000f121300000,0x000000f121400000],  sp=0x000000f1213fe9f0,  free space=1018k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [jvm.dll+0x601c63]  Dependencies::check_call_site_target_value+0x3e3  (dependencies.cpp:2038)
V  [jvm.dll+0x60182f]  Dependencies::DepStream::check_call_site_dependency+0x6f  (dependencies.cpp:2145)
V  [jvm.dll+0xd0898d]  nmethod::check_dependency_on+0x7d  (nmethod.cpp:2853)
V  [jvm.dll+0x609a19]  DependencyContext::mark_dependent_nmethods+0xb9  (dependencyContext.cpp:74)
V  [jvm.dll+0xcd2ff2]  MethodHandles::mark_dependent_nmethods+0xa2  (methodHandles.cpp:953)
V  [jvm.dll+0xccbf2e]  MHN_setCallSiteTargetNormal+0x1ee  (methodHandles.cpp:1211)
C  0x000002b02377ebac  (no source info available)

Same test also fails with:

# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (/workspace/open/src/hotspot/share/oops/compressedKlass.inline.hpp:88), pid=95938, tid=35843
#  assert(nk >= _lowest_valid_narrow_klass_id && nk <= _highest_valid_narrow_klass_id) failed: narrowKlass ID out of range (3131947710)

---------------  T H R E A D  ---------------

Current thread (0x0000000142865e10):  JavaThread "MainThread"        [_thread_in_vm, id=35843, stack(0x0000000170c4c000,0x0000000170e4f000) (2060K)]

Stack: [0x0000000170c4c000,0x0000000170e4f000],  sp=0x0000000170e4dd00,  free space=2055k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.dylib+0x1161294]  VMError::report(outputStream*, bool)+0x1aac  (compressedKlass.inline.hpp:88)
V  [libjvm.dylib+0x1164880]  VMError::report_and_die(int, char const*, char const*, char*, Thread*, unsigned char*, void const*, void const*, char const*, int, unsigned long)+0x548
V  [libjvm.dylib+0x576b0c]  print_error_for_unit_test(char const*, char const*, char*)+0x0
V  [libjvm.dylib+0x70a898]  CompressedKlassPointers::check_encodable(void const*)+0x0
V  [libjvm.dylib+0x585d24]  oopDesc::klass() const+0x104
V  [libjvm.dylib+0x5afbec]  Dependencies::check_call_site_target_value(oop, oop, CallSiteDepChange*)+0xcc
V  [libjvm.dylib+0x5b09e4]  Dependencies::DepStream::check_call_site_dependency(CallSiteDepChange*)+0x84
V  [libjvm.dylib+0xde663c]  nmethod::check_dependency_on(DepChange&)+0x50
V  [libjvm.dylib+0x5b291c]  DependencyContext::mark_dependent_nmethods(DeoptimizationScope*, DepChange&)+0xa4
V  [libjvm.dylib+0xda4dfc]  MethodHandles::mark_dependent_nmethods(DeoptimizationScope*, Handle, Handle)+0xec
V  [libjvm.dylib+0xda7ea4]  MHN_setCallSiteTargetNormal+0x304
j  java.lang.invoke.MethodHandleNatives.setCallSiteTargetNormal(Ljava/lang/invoke/CallSite;Ljava/lang/invoke/MethodHandle;)V+0 java.base@25-internal
j  java.lang.invoke.CallSite.setTargetNormal(Ljava/lang/invoke/MethodHandle;)V+7 java.base@25-internal
j  java.lang.invoke.MutableCallSite.setTarget(Ljava/lang/invoke/MethodHandle;)V+2 java.base@25-internal
j  compiler.jsr292.CallSiteDepContextTest.testGC(ZZ)V+267
j  compiler.jsr292.CallSiteDepContextTest.main([Ljava/lang/String;)V+11
j  java.lang.invoke.LambdaForm$DMH+0x00000c0000001c00.invokeStatic(Ljava/lang/Object;Ljava/lang/Object;)V+10 java.base@25-internal
j  java.lang.invoke.LambdaForm$MH+0x00000c0000003400.invoke(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+33 java.base@25-internal
j  java.lang.invoke.Invokers$Holder.invokeExact_MT(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+20 java.base@25-internal
j  jdk.internal.reflect.DirectMethodHandleAccessor.invokeImpl(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+55 java.base@25-internal
j  jdk.internal.reflect.DirectMethodHandleAccessor.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+23 java.base@25-internal
j  java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+102 java.base@25-internal
j  com.sun.javatest.regtest.agent.MainWrapper$MainTask.run()V+134
j  java.lang.Thread.runWith(Ljava/lang/Object;Ljava/lang/Runnable;)V+5 java.base@25-internal
j  java.lang.Thread.run()V+19 java.base@25-internal
v  ~StubRoutines::call_stub 0x00000001160d0180
V  [libjvm.dylib+0x88beec]  JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, JavaThread*)+0x448
V  [libjvm.dylib+0x88ac08]  JavaCalls::call_virtual(JavaValue*, Klass*, Symbol*, Symbol*, JavaCallArguments*, JavaThread*)+0x1c4
V  [libjvm.dylib+0x88ada4]  JavaCalls::call_virtual(JavaValue*, Handle, Klass*, Symbol*, Symbol*, JavaThread*)+0x6c
V  [libjvm.dylib+0xa1289c]  thread_entry(JavaThread*, JavaThread*)+0x12c
V  [libjvm.dylib+0x8c2088]  JavaThread::thread_main_inner()+0x1a8
V  [libjvm.dylib+0x10ae844]  Thread::call_run()+0xf4
V  [libjvm.dylib+0xe467f0]  thread_native_entry(Thread*)+0x138
C  [libsystem_pthread.dylib+0x6f94]  _pthread_start+0x88
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  java.lang.invoke.MethodHandleNatives.setCallSiteTargetNormal(Ljava/lang/invoke/CallSite;Ljava/lang/invoke/MethodHandle;)V+0 java.base@25-internal
j  java.lang.invoke.CallSite.setTargetNormal(Ljava/lang/invoke/MethodHandle;)V+7 java.base@25-internal
j  java.lang.invoke.MutableCallSite.setTarget(Ljava/lang/invoke/MethodHandle;)V+2 java.base@25-internal
j  compiler.jsr292.CallSiteDepContextTest.testGC(ZZ)V+267
j  compiler.jsr292.CallSiteDepContextTest.main([Ljava/lang/String;)V+11
j  java.lang.invoke.LambdaForm$DMH+0x00000c0000001c00.invokeStatic(Ljava/lang/Object;Ljava/lang/Object;)V+10 java.base@25-internal
j  java.lang.invoke.LambdaForm$MH+0x00000c0000003400.invoke(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+33 java.base@25-internal
j  java.lang.invoke.Invokers$Holder.invokeExact_MT(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+20 java.base@25-internal
j  jdk.internal.reflect.DirectMethodHandleAccessor.invokeImpl(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+55 java.base@25-internal
j  jdk.internal.reflect.DirectMethodHandleAccessor.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+23 java.base@25-internal
j  java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+102 java.base@25-internal
j  com.sun.javatest.regtest.agent.MainWrapper$MainTask.run()V+134
j  java.lang.Thread.runWith(Ljava/lang/Object;Ljava/lang/Runnable;)V+5 java.base@25-internal
j  java.lang.Thread.run()V+19 java.base@25-internal
v  ~StubRoutines::call_stub 0x00000001160d0180
Lock stack of current Java thread (top to bottom):

TobiHartmann avatar Feb 03 '25 09:02 TobiHartmann

@stefank just pointed out that this could be a regression from JDK-8347564 that just went in.

TobiHartmann avatar Feb 03 '25 09:02 TobiHartmann

@stefank just pointed out that this could be a regression from JDK-8347564 that just went in.

I thought that it could have been a regression but after looking closer I don't think it is. However, I do see a problem that the code removes the GC barriers when loading oops. I'll add an inline comment.

stefank avatar Feb 03 '25 12:02 stefank

When using Shenadoah I am still seeing embedded compressed OOPs in the instruction stream. Is that correct? These form two instructions, like so:

  movz(dst, 0xDEAD, 16);
  movk(dst, 0xBEEF);

Do you want compressed OOPs to be moved out of CodeCache as well as uncompressed OOPs? If so, you should change loadConNNodein C2.

theRealAph avatar Feb 03 '25 15:02 theRealAph

Mailing list message from Andrew Haley on hotspot-compiler-dev:

On 2/7/25 18:33, Boris Ulasevich wrote:

I think ADP+MOVK is better both in terms of performance and code density.

Good work, you may be right.

Neoverse N1 has a fairly narrow (4 wide) decoder, so I guess it's more likely to be limited by instruction count.

That benchmark isn't valid for GCC on my machine, because its outputs aren't used so GCC doesn't generate code for the asm.

However, if we change the benchmark to actually *do something* with the data (simply add the results together) we get this for movz+movk on Apple M1:

2,332,615,983 cycles:u # 3.135 GHz (95.26%) 18,660,205,348 instructions:u # 8.00 insn per cycle (95.26%)

and this for adrp+movk:

2,563,872,489 cycles:u # 3.057 GHz (96.03%) 14,357,197,644 instructions:u # 5.60 insn per cycle (96.03%)

Here we can see that the M1 is totally front-end limited: 8 ipc is the speed of light on an M1. Nonetheless, the timings are similar, with the win going to movz+movk.

On Neoverse V2, I also see an advantage for adrp+movk:

4162362189 cycles:u # 2.796 GHz 25002398111 instructions:u # 6.01 insn per cycle

3243420864 cycles:u # 2.796 GHz 21002398115 instructions:u # 6.48 insn per cycle

So, looks like adrp+movk has an overall advantage. I'm still somewhat skeptical that this usage really deserves a reloc handler of its own, though, given the usage.

If we do decide to do this, please give the forced movk version of adrp() a new name, and have adrp() call it.

Having said all of that, I'm not sure why we're seeing such different instruction counts for Apple M1 and Neoverse V2, I guess it must be the compiler but I don't know why, so take all of this with a big pinch of salt. For this really to be valid I guess we'd have to use the exact same binaries.

-- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

mlbridge[bot] avatar Feb 08 '25 10:02 mlbridge[bot]

Mailing list message from Andrew Haley on hotspot-compiler-dev:

On 2/8/25 10:38, Andrew Haley wrote:

On 2/7/25 18:33, Boris Ulasevich wrote:

I think ADP+MOVK is better both in terms of performance and code density.

Good work, you may be right.

Neoverse N1 has a fairly narrow (4 wide) decoder, so I guess it's more likely to be limited by instruction count.

That benchmark isn't valid for GCC on my machine, because its outputs aren't used so GCC doesn't generate code for the asm.

However, if we change the benchmark to actually *do something* with the data (simply add the results together) we get this for movz+movk on Apple M1:

Sorry, I messed up:

2,916,015,951 cycles:u # 3.151 GHz (95.33%) 22,773,874,629 instructions:u # 7.81 insn per cycle (95.33%)

adrp:

3,132,136,323 cycles:u # 3.114 GHz (96.60%) 18,420,464,246 instructions:u # 5.88 insn per cycle (96.60%)

-- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

mlbridge[bot] avatar Feb 08 '25 11:02 mlbridge[bot]