corretto-jdk icon indicating copy to clipboard operation
corretto-jdk copied to clipboard

Poor performance of cache invalidation on Graviton2 machines

Open charlesconnell opened this issue 4 months ago • 12 comments

Describe the bug

Invalidating instruction cache lines is unusually slow on Graviton2-based EC2 instances. This is detrimental to the JVM's performance. While this is ultimately a hardware problem, I think it's fair to expect Corretto to work around quirks of AWS's own hardware. I've observed the problem on Linux, but it may also exist on Windows.

In my use case, this issue manifests after a major ZGC collection. A ZGC major collection marks all nmethods as armed, and they must be disarmed by the next thread to encounter the nmethod. Disarming involves invalidating an instruction cache line. Most disarms are accomplished by user threads wanting to run the armed method.

On Linux + aarch64, the JVM uses __builtin___clear_cache() to accomplish instruction cache invalidation. GCC and LLVM implement this using, in part, the ic ivau instruction. This instruction purports to invalidate the instruction cache for a specific address. I have observed effects that suggest that on Graviton 2, ic ivau invalides the entire instruction cache. __builtin___clear_cache() uses dmb to broadcast this invalidation to all cores, so __builtin___clear_cache() effectively invalidates the instruction caches for all cores each time it runs. This then incurs a performance penalty as the cache is refilled.

To Reproduce

The cache invalidation performance issue is best observed in a tiny example outside of the JVM:

int main() {
  int foo;
  for (int i = 0; i < 10000000; i++) {
    __builtin___clear_cache(&foo, &foo+4);
  }
}

When the above C program is compiled and run on an i4g.large (Graviton2) machine, it takes 5-10 seconds to run. Cycle counting via perf reveals 70%-80% of cycles are spent in the instruction immediately following the ic instruction (a simple add):

Samples: 25K of event 'cycles', 4000 Hz, Event count (approx.): 976565821
__aarch64_sync_cache_range  /home/cconnell/clear [Percent: local period]
Percent│      mov  w3, #0x4                        // #4
       │      lsl  w3, w3, w2
       │      sub  w2, w3, #0x1
       │      bic  x2, x0, x2
       │      cmp  x2, x1
       │    ↓ b.cs 50
       │      sxtw x3, w3
       │      nop
       │40:   dc   cvau, x2
       │      add  x2, x2, x3
       │      cmp  x1, x2
       │    ↑ b.hi 40
  0.92 │50:   dsb  ish
 13.03 │      ldr  w2, [x5, #32]
       │    ↓ tbnz w2, #29, 94
       │      and  w4, w4, #0xf
       │      mov  w2, #0x4                        // #4
       │      lsl  w3, w2, w4
       │      sub  w2, w3, #0x1
       │      bic  x0, x0, x2
       │      sxtw x2, w3
  1.15 │      cmp  x1, x0
       │    ↓ b.ls 90
       │      nop
       │80:   ic   ivau, x0
 83.46 │      add  x0, x0, x2
       │      cmp  x1, x0
       │    ↑ b.hi 80
  0.00 │90:   dsb  ish
  0.01 │94:   isb
  0.02 │    ← ret

This suggests that ic ivau invalidated the entire instruction cache, and the execution of the next instruction is waiting for it to refill. On an i8g.large (Graviton4) machine, this program takes 0.2 seconds to run, and the cycle counts are distributed very differently.

Expected behavior

I hope Corretto can work around the ic ivau quirk. Knowing now that it's only possible to clear the entire instruction cache at once, the Corretto JVM should use different logic about when to do cache invalidations. It can do many less without sacrificing correctness.

Screenshots

not applicable

Platform information

Linux <redacted> 6.1.127-<redacted>.aarch64 #1 SMP Wed Feb 19 00:18:56 UTC 2025 aarch64 aarch64 aarch64 GNU/Linux

Problem discovered on:

openjdk version "21.0.2" 2024-01-16 LTS
OpenJDK Runtime Environment Temurin-21.0.2+13 (build 21.0.2+13-LTS)
OpenJDK 64-Bit Server VM Temurin-21.0.2+13 (build 21.0.2+13-LTS, mixed mode, sharing)

However code reading shows that Corretto has the same issue in at least corretto-21 and corretto-jdk.

Additional context

I will attach CPU profile flamegraphs showing considerable CPU time spent in aarch64_sync_cache_range (effectively the same as __builtin___clear_cache)

charlesconnell avatar Aug 30 '25 22:08 charlesconnell

Two CPU profile flamegraphs catching the problem in action in an HBase server:

async-prof-flamegraph-cpu-1755850777556.html

async-prof-cpu-1756131323018.html

charlesconnell avatar Aug 30 '25 22:08 charlesconnell

Thank you very much for taking the time to submit a detailed report. We'll take a look.

synecdoche avatar Aug 31 '25 00:08 synecdoche

GCC implementation: https://github.com/gcc-mirror/gcc/blob/master/libgcc/config/aarch64/sync-cache.c#L31 LLVM implementaion: https://github.com/llvm/llvm-project/blob/main/compiler-rt/lib/builtins/clear_cache.c#L123

eastig avatar Sep 02 '25 20:09 eastig

Graviton 2 is Neoverse N1. On Graviton 2 we need the current behaviour because of Neoverse N1 errata #1542419: https://developer.arm.com/documentation/SDEN885747/latest/

The core might fetch a stale instruction from memory which violates the ordering of instruction fetches

Description: When the core executes an instruction that has been recently modified, the core might fetch a stale instruction, which violates the ordering of instruction fetches. This is due to the architecture changes involving prefetch-speculation-protection.

On Graviton 2 we have the loop of ic ivau because instruction cache hardware coherency is disabled (CTR_EL0.DIC is not set) due to errata #1542419. Graviton 3, which is Neoverse-V1, does not have the issue, CTR_EL0.DIC is set. There is no loop.

eastig avatar Sep 02 '25 21:09 eastig

Okay, understood. I still had the question of whether ic is always so slow. Looking at how Linux handles the errata, it (well really the firmware) is trapping ic instructions. I compiled my own kernel so that I could compare the performance of ic when it's trapped versus when it's not. This suggests to me that the trap handler is why ic is slow for me.

charlesconnell avatar Sep 03 '25 15:09 charlesconnell

Presumably, we need to invalidate cache lines when we unload classes (most likely correlation with major ZGC collection), when we deoptimize certain translations due to invalidation of assumed optimizations, and when we choose to reprioritize our notions of which methods are hot and which are not.

@eastig: Given that the graviton2 implementation of __builtin__clear_cache() has much broader impact than the graviton4 implementation, is there any possibility of improved performance by reducing the number of calls to this (or let's rename it) service?

My question may be off target. There's a lot going on here that I am not familiar with.

Something like the following: #ifdef Graviton2 __builtin_clear_icache_everywhere() #else foreach relevant_cache_line do { __builtin__clear_cache(relevant_cache_line) } #endif

Separate from the question of whether this is a possible approach, we would need to address the question of whether this sort of specialization for Graviton 2 is a "manageable" approach.

kdnilsen avatar Oct 01 '25 22:10 kdnilsen

From some recent discussions, it seems that my list of reasons to invalidate cache lines enumerated above is incomplete. GenZGC apparently self-modifies certain code at each GC phase change. This is a very frequent event. We have not completed analysis of this behavior. If our understanding is correct, this would suggest that an alternative approach such as GenShen might behave better than GenZGC on Graviton 2, even without any specialized code generation for Graviton 2.

kdnilsen avatar Oct 02 '25 16:10 kdnilsen

(There is also a nightly build of Corretto 21 that has GenShen support here: of Corretto-21

kdnilsen avatar Oct 02 '25 17:10 kdnilsen

Hi guys, following up on our conversation earlier today. I looked at the GC logs from a couple of representative HBase servers in our fleet. Both of them had 6 days of uptime. One was i4g.4xlarge, and one was i8g.4xlarge. I see these nmethod counts:

  • NMethods: 10066 registered, 1753 unregistered (i4g)
  • NMethods: 11652 registered, 9728 unregistered (i8g)

I also wanted to respond to something I missed during our discussion. @eastig, you made a comment that the ZGC authors may not have considered the impact of clearing icache lines on ARM, where instruction cache coherence is not always available. I want to clarify that I don't believe that instruction cache coherence is required for good ZGC performance, and that it's generally okay to clear icache lines. However, on Graviton2, the ic instructions used to clear icache lines are trapped, and I believe it's the trap handler specifically that is slow, not ic in general.

charlesconnell avatar Oct 10 '25 19:10 charlesconnell

We've tried to create a reproducer for the problem identified here. This reproducer uses a configuration of the Extremem workload. Here is the IntelliJ timeline for a Graviton 2 run that exhibited very bad performance with GenZGC.

Image

I see some indications that this matches the behavior described in the blog post: https://product.hubspot.com/blog/its-never-a-hardware-bug-until-it-is

During the concurrent GC "bursts" that begin around time 0, 4.5m, 12.7m, 20m, 24.5m, I see clear increases in the work performed by mutator threads 0 or 166.

I guess I can also see some correlations with GC activity when Thread-166 has bursts of activity at time 3m, 7m, 9m. In these cases, the mutator activity seems to precede the GC activity. Maybe that's to be expected. Does the GC invalidate the icache lines before it starts its "heavy lifting"? Actually, these bursts of mutator activity may be triggered by C2 compiler activity, which also needs to invalidate icache.

It looks to me like there are also bursts of mutator (thread-166) activity at times 11m, 15m, 19m, and 23m for which I cannot identify any associated GC activity.

Bottom line: do you believe this evidence is sufficiently similar to your production workload analysis to represent a fair reproducer?

Knowing that we have a representative reproducer will help us test improvements we are making to GenZGC and also allow us to confirm that Generational Shenandoah does not have the same problem that was found with Generational ZGC (as we have already run Generational Shenandoah on this workload and observed no performance issues).

kdnilsen avatar Oct 14 '25 21:10 kdnilsen

do you believe this evidence is sufficiently similar to your production workload analysis to represent a fair reproducer?

To answer that confidently, can I ask you to check into some of the CPU samples in your profile? If this is a good reproducer of my issue, then I would expect that the samples in Thread-0 and Thread-166 that are contemporaneous with GC activity would mostly be calling __aarch64_sync_cache_range.

charlesconnell avatar Oct 14 '25 21:10 charlesconnell

Thanks for that additional hint. I am not seeing large numbers of samples of __aarch64_sync_cache_range: only 2 samples during 25 minutes of profiling. The correlation of GC work to thread activity is apparently driven by cyclic behavior of this "extreme" workload, which "rebuilds" its persistent database every couple of minutes. I'll keep looking.

kdnilsen avatar Oct 15 '25 15:10 kdnilsen

Hi @charlesconnell, I created JDK-8370947 with a simple reproducer. In it you also can find some possible overhead reduction.

eastig avatar Nov 10 '25 22:11 eastig