Poor performance of cache invalidation on Graviton2 machines
Describe the bug
Invalidating instruction cache lines is unusually slow on Graviton2-based EC2 instances. This is detrimental to the JVM's performance. While this is ultimately a hardware problem, I think it's fair to expect Corretto to work around quirks of AWS's own hardware. I've observed the problem on Linux, but it may also exist on Windows.
In my use case, this issue manifests after a major ZGC collection. A ZGC major collection marks all nmethods as armed, and they must be disarmed by the next thread to encounter the nmethod. Disarming involves invalidating an instruction cache line. Most disarms are accomplished by user threads wanting to run the armed method.
On Linux + aarch64, the JVM uses __builtin___clear_cache() to accomplish instruction cache invalidation. GCC and LLVM implement this using, in part, the ic ivau instruction. This instruction purports to invalidate the instruction cache for a specific address. I have observed effects that suggest that on Graviton 2, ic ivau invalides the entire instruction cache. __builtin___clear_cache() uses dmb to broadcast this invalidation to all cores, so __builtin___clear_cache() effectively invalidates the instruction caches for all cores each time it runs. This then incurs a performance penalty as the cache is refilled.
To Reproduce
The cache invalidation performance issue is best observed in a tiny example outside of the JVM:
int main() {
int foo;
for (int i = 0; i < 10000000; i++) {
__builtin___clear_cache(&foo, &foo+4);
}
}
When the above C program is compiled and run on an i4g.large (Graviton2) machine, it takes 5-10 seconds to run. Cycle counting via perf reveals 70%-80% of cycles are spent in the instruction immediately following the ic instruction (a simple add):
Samples: 25K of event 'cycles', 4000 Hz, Event count (approx.): 976565821
__aarch64_sync_cache_range /home/cconnell/clear [Percent: local period]
Percent│ mov w3, #0x4 // #4
│ lsl w3, w3, w2
│ sub w2, w3, #0x1
│ bic x2, x0, x2
│ cmp x2, x1
│ ↓ b.cs 50
│ sxtw x3, w3
│ nop
│40: dc cvau, x2
│ add x2, x2, x3
│ cmp x1, x2
│ ↑ b.hi 40
0.92 │50: dsb ish
13.03 │ ldr w2, [x5, #32]
│ ↓ tbnz w2, #29, 94
│ and w4, w4, #0xf
│ mov w2, #0x4 // #4
│ lsl w3, w2, w4
│ sub w2, w3, #0x1
│ bic x0, x0, x2
│ sxtw x2, w3
1.15 │ cmp x1, x0
│ ↓ b.ls 90
│ nop
│80: ic ivau, x0
83.46 │ add x0, x0, x2
│ cmp x1, x0
│ ↑ b.hi 80
0.00 │90: dsb ish
0.01 │94: isb
0.02 │ ← ret
This suggests that ic ivau invalidated the entire instruction cache, and the execution of the next instruction is waiting for it to refill. On an i8g.large (Graviton4) machine, this program takes 0.2 seconds to run, and the cycle counts are distributed very differently.
Expected behavior
I hope Corretto can work around the ic ivau quirk. Knowing now that it's only possible to clear the entire instruction cache at once, the Corretto JVM should use different logic about when to do cache invalidations. It can do many less without sacrificing correctness.
Screenshots
not applicable
Platform information
Linux <redacted> 6.1.127-<redacted>.aarch64 #1 SMP Wed Feb 19 00:18:56 UTC 2025 aarch64 aarch64 aarch64 GNU/Linux
Problem discovered on:
openjdk version "21.0.2" 2024-01-16 LTS
OpenJDK Runtime Environment Temurin-21.0.2+13 (build 21.0.2+13-LTS)
OpenJDK 64-Bit Server VM Temurin-21.0.2+13 (build 21.0.2+13-LTS, mixed mode, sharing)
However code reading shows that Corretto has the same issue in at least corretto-21 and corretto-jdk.
Additional context
I will attach CPU profile flamegraphs showing considerable CPU time spent in aarch64_sync_cache_range (effectively the same as __builtin___clear_cache)
Two CPU profile flamegraphs catching the problem in action in an HBase server:
Thank you very much for taking the time to submit a detailed report. We'll take a look.
GCC implementation: https://github.com/gcc-mirror/gcc/blob/master/libgcc/config/aarch64/sync-cache.c#L31 LLVM implementaion: https://github.com/llvm/llvm-project/blob/main/compiler-rt/lib/builtins/clear_cache.c#L123
Graviton 2 is Neoverse N1. On Graviton 2 we need the current behaviour because of Neoverse N1 errata #1542419: https://developer.arm.com/documentation/SDEN885747/latest/
The core might fetch a stale instruction from memory which violates the ordering of instruction fetches
Description: When the core executes an instruction that has been recently modified, the core might fetch a stale instruction, which violates the ordering of instruction fetches. This is due to the architecture changes involving prefetch-speculation-protection.
On Graviton 2 we have the loop of ic ivau because instruction cache hardware coherency is disabled (CTR_EL0.DIC is not set) due to errata #1542419.
Graviton 3, which is Neoverse-V1, does not have the issue, CTR_EL0.DIC is set. There is no loop.
Okay, understood. I still had the question of whether ic is always so slow. Looking at how Linux handles the errata, it (well really the firmware) is trapping ic instructions. I compiled my own kernel so that I could compare the performance of ic when it's trapped versus when it's not. This suggests to me that the trap handler is why ic is slow for me.
Presumably, we need to invalidate cache lines when we unload classes (most likely correlation with major ZGC collection), when we deoptimize certain translations due to invalidation of assumed optimizations, and when we choose to reprioritize our notions of which methods are hot and which are not.
@eastig: Given that the graviton2 implementation of __builtin__clear_cache() has much broader impact than the graviton4 implementation, is there any possibility of improved performance by reducing the number of calls to this (or let's rename it) service?
My question may be off target. There's a lot going on here that I am not familiar with.
Something like the following: #ifdef Graviton2 __builtin_clear_icache_everywhere() #else foreach relevant_cache_line do { __builtin__clear_cache(relevant_cache_line) } #endif
Separate from the question of whether this is a possible approach, we would need to address the question of whether this sort of specialization for Graviton 2 is a "manageable" approach.
From some recent discussions, it seems that my list of reasons to invalidate cache lines enumerated above is incomplete. GenZGC apparently self-modifies certain code at each GC phase change. This is a very frequent event. We have not completed analysis of this behavior. If our understanding is correct, this would suggest that an alternative approach such as GenShen might behave better than GenZGC on Graviton 2, even without any specialized code generation for Graviton 2.
(There is also a nightly build of Corretto 21 that has GenShen support here: of Corretto-21
Hi guys, following up on our conversation earlier today. I looked at the GC logs from a couple of representative HBase servers in our fleet. Both of them had 6 days of uptime. One was i4g.4xlarge, and one was i8g.4xlarge. I see these nmethod counts:
NMethods: 10066 registered, 1753 unregistered(i4g)NMethods: 11652 registered, 9728 unregistered(i8g)
I also wanted to respond to something I missed during our discussion. @eastig, you made a comment that the ZGC authors may not have considered the impact of clearing icache lines on ARM, where instruction cache coherence is not always available. I want to clarify that I don't believe that instruction cache coherence is required for good ZGC performance, and that it's generally okay to clear icache lines. However, on Graviton2, the ic instructions used to clear icache lines are trapped, and I believe it's the trap handler specifically that is slow, not ic in general.
We've tried to create a reproducer for the problem identified here. This reproducer uses a configuration of the Extremem workload. Here is the IntelliJ timeline for a Graviton 2 run that exhibited very bad performance with GenZGC.
I see some indications that this matches the behavior described in the blog post: https://product.hubspot.com/blog/its-never-a-hardware-bug-until-it-is
During the concurrent GC "bursts" that begin around time 0, 4.5m, 12.7m, 20m, 24.5m, I see clear increases in the work performed by mutator threads 0 or 166.
I guess I can also see some correlations with GC activity when Thread-166 has bursts of activity at time 3m, 7m, 9m. In these cases, the mutator activity seems to precede the GC activity. Maybe that's to be expected. Does the GC invalidate the icache lines before it starts its "heavy lifting"? Actually, these bursts of mutator activity may be triggered by C2 compiler activity, which also needs to invalidate icache.
It looks to me like there are also bursts of mutator (thread-166) activity at times 11m, 15m, 19m, and 23m for which I cannot identify any associated GC activity.
Bottom line: do you believe this evidence is sufficiently similar to your production workload analysis to represent a fair reproducer?
Knowing that we have a representative reproducer will help us test improvements we are making to GenZGC and also allow us to confirm that Generational Shenandoah does not have the same problem that was found with Generational ZGC (as we have already run Generational Shenandoah on this workload and observed no performance issues).
do you believe this evidence is sufficiently similar to your production workload analysis to represent a fair reproducer?
To answer that confidently, can I ask you to check into some of the CPU samples in your profile? If this is a good reproducer of my issue, then I would expect that the samples in Thread-0 and Thread-166 that are contemporaneous with GC activity would mostly be calling __aarch64_sync_cache_range.
Thanks for that additional hint. I am not seeing large numbers of samples of __aarch64_sync_cache_range: only 2 samples during 25 minutes of profiling. The correlation of GC work to thread activity is apparently driven by cyclic behavior of this "extreme" workload, which "rebuilds" its persistent database every couple of minutes. I'll keep looking.
Hi @charlesconnell, I created JDK-8370947 with a simple reproducer. In it you also can find some possible overhead reduction.