jdk
jdk copied to clipboard
8339573: Update CodeCacheSegmentSize and CodeEntryAlignment for ARM
With this change, I have adjusted the default settings for CodeCacheSegmentSize and CodeEntryAlignment for AARCH and ARM32. The main goal is to improve code density by reducing the number of wasted bytes (approximately 4% waste). Improving code density may also have the side effect of boosting performance in large applications
Each nmethod occupies a number of code cache segments (minimum allocation blocks). Since the size of an nmethod is not aligned to 128 bytes, the last occupied segment is half empty. Reducing the size of the code cache segments correspondingly minimizes waste. However, we should be careful about reducing the CodeCacheSegmentSize too much, as smaller segment sizes will increase the overhead of the CodeHeap::_segmap bitmap. A CodeCacheSegmentSize of 64 seems to be an optimal balance.
The current large default value for CodeCacheSegmentSize (64+64) was historically introduced with the comment "Tiered compilation has large code-entry alignment" which doesn't make much sense to me. The history of this comment and value is as follows:
- The PPC port was introduced with CodeEntryAlignment=128 (recently reduced to 64: https://github.com/openjdk/jdk/commit/09a78b5d) and CodeCacheSegmentSize was adjusted accordingly for that platform.
- Soon after, the 128-byte alignment was applied to all platforms to hide a debug mode warning (https://github.com/openjdk/jdk/commit/e8bc971d). Despite the change (and Segmented Code Cache introduced later), the warning can still be reproduced today using the -XX:+VerifyCodeCache fastdebug option in large applications (10K nmethods ~ 10K free blocks in between them).
I believe it is time to remove the comment and update the default value.
I also suggest updating the default CodeEntryAlignment value for AARCH. The current setting is much larger than for x86 and was likely based on the typical cache line size of 64 bytes. Cortex-A57, A72 architecture software optimisation guides recommend a 32-byte alignment for subroutine entry points. Neoverse architecture software optimisation guides do not mention recommended entry point alignment.
For reference, the default function_align setting in GCC is typically 16 or 32 bytes, depending on the target architecture.
Hotspot performance tests with -XX:CodeCacheSegmentSize=64 and -XX:CodeEntryAlignment=16 options showed the following results:
- No performance impact on Hotspot microbenchmarks and other microbenchmarks.
- On the Renaissance Dotty benchmark:
- 0.2-0.4% performance improvement on Neoverse N1/V1/V2 architectures
- 0.7% performance improvement on Raspberry Pi Model 3 (ARM32, ARM1176JZF-S)
- slight performance degradation on Cortex-A72, reproducible only with the CodeEntryAlignment update.
I suggest changing the CodeCacheSegmentSize for AARCH64 and ARM32 and updating the CodeEntryAlignment for AARCH64 Neoverse platforms.
Progress
- [ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
- [x] Change must not contain extraneous whitespace
- [x] Commit message must refer to an issue
Issue
- JDK-8339573: Update CodeCacheSegmentSize and CodeEntryAlignment for ARM (Enhancement - P4)
Reviewers
- Vladimir Kozlov (@vnkozlov - Reviewer) 🔄 Re-review required (review applies to 10ecb900)
- Evgeny Astigeevich (@eastig - Committer)
Reviewing
Using git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/20864/head:pull/20864
$ git checkout pull/20864
Update a local copy of the PR:
$ git checkout pull/20864
$ git pull https://git.openjdk.org/jdk.git pull/20864/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 20864
View PR using the GUI difftool:
$ git pr show -t 20864
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/20864.diff
Webrev
:wave: Welcome back bulasevich! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.
@bulasevich This change now passes all automated pre-integration checks.
ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.
After integration, the commit message for the final commit will be:
8339573: Update CodeCacheSegmentSize and CodeEntryAlignment for ARM
Reviewed-by: kvn, eastigeevich
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.
At the time when this comment was updated there had been 762 new commits pushed to the master branch:
- a95374f588149d80068275a496ba4aa04b3bb4fd: 8343101: Rework BasicTest.testTemp test cases
- 00fe9f7bdfd245791bca6b5b1b2d0a98d41af221: 8343100: Consolidate EmptyFolderTest and EmptyFolderPackageTest jpackage tests into single java file
- 9f6d5b46ce2cfcdb39f94b8ac8621ee21f4e8740: 8343020: (fs) Add support for SecureDirectoryStream on macOS
- 1341b81321fe77005ba68fba19c7d83e3fcb5fde: 8341666: FileInputStream doesn't support readAllBytes() or readNBytes(int) on pseudo devices
- 52382e285fdf853c01605f8e0d7f3f5d34965802: 8338021: Support new unsigned and saturating vector operators in VectorAPI
- e659d9da5d6198ad9c85efd6472e138a6a3961c2: 8342975: C2: Micro-optimize PhaseIdealLoop::Dominators()
- 9f6211bcf1b46e4bfba2d128d9eb8457bc0cde51: 8341371: CDS cannot load archived heap objects with -XX:+UseSerialGC -XX:-UseCompressedOops
- 120a9357b3cf63427a6c8539128b69b11b9beca3: 8342561: Metaspace for generated reflection classes is no longer needed
- d5fb6b4a3cf4926acb333e7ee55f96fc76225631: 8339939: [JVMCI] Don't compress abstract and interface Klasses
- a5ad974bec932c63ddc647c9986a513ae32ef663: 8343056: C2: Micro-optimize Node lists grow
- ... and 752 more: https://git.openjdk.org/jdk/compare/a7120e2b251e1337df5bd4a2808638d28b7d3bd3...master
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.
➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.
@bulasevich The following label will be automatically applied to this pull request:
hotspot
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.
Looks like good low-hanging fruit to me. Could we ask @veresov why he made this change? Thanks!
I don't quite remember making this change... And I don't remember any reasons as to why it might have been needed.
As @bulasevich pointed the change was done by Albert to fix VM warning: https://bugs.openjdk.org/browse/JDK-8029799
Quote: "The freelist of the code cache exceeds 10'000 items, which results in a VM warning. The problem behind the warning is that the freelist is populated by a large number of small free blocks. For example, in failing test case (see header), the freelist grows up to more than 3500 items where the largest item on the list is 9 segments (one segment is 64 bytes). That experiment was done on my laptop. Such a large freelist can indeed be a performance problem, since we use a linear search to traverse the freelist."
The warning is about huge freelist which is scanned linearly to find corresponding free space in CodeCache for next allocation. It is become big with tiered compilation because we do a lot of C1 compiled code which is replaced with C2 compiled code.
The fix for 8029799 did optimization for freelist search for allocation by selecting first which have enough space. This reduce time of search but on other hand may increase fragmentation of CodeCache space.
There were several optimization done for this code by @RealLucy JDK-8223444 and JDK-8231460. But it is still using linked list for free segments. Should we consider something more complex? Or it is not an issue?
Hotspot performance tests with -XX:CodeCacheSegmentSize=64 and -XX:CodeEntryAlignment=16 options showed the following results:
Which of these two flags setting improved performance most?
Hotspot performance tests with -XX:CodeCacheSegmentSize=64 and -XX:CodeEntryAlignment=16 options showed the following results:
Which of these two flags setting improved performance most?
There is a high noise in the benchmark. With 1K of iterations on Neoverse V2 machine the result is:
| vm options | ms/op | % | cache-misses (M) | group0-code_sparsity (K) | L1-icache-load-misses:u (M) |
|---|---|---|---|---|---|
| default | 635.574 | 7193 | 18152 | 10111 | |
| CodeEntryAlignment | 633.961 | -0.25 | 7177 | 18156 | 10334 |
| CodeCacheSegmentSize | 633.041 | -0.15 | 7187 | 17919 | 10081 |
| both | 631.687 | -0.21 | 7167 | 17796 | 10287 |
The warning is about huge freelist which is scanned linearly to find corresponding free space in CodeCache for next allocation. It is become big with tiered compilation because we do a lot of C1 compiled code which is replaced with C2 compiled code.
With Segmented Code Cache we have separate freelists for profiled and non-profiled heap. So I think there is no need to correct segment size for tiered compilation.
But it is still using linked list for free segments. Should we consider something more complex? Or it is not an issue?
This can indeed be a problem. CodeHeap allocation takes about 1% of the total compilation time. Here is my statistic for 50K compiled methods: before: total_compilation_time(s): 370, total_allocation_time(s): 1.6 after: total_compilation_time(s): 378, total_allocation_time(s): 3.4
With Segmented Code Cache we have separate freelists for profiled and non-profiled heap. So I think there is no need to correct segment size for tiered compilation.
It is indeed helped somewhat but as you said:
the warning can still be reproduced today using the -XX:+VerifyCodeCache fastdebug option in large applications (10K nmethods ~ 10K free blocks in between them).
Anyway, I agree to trade off 1% in compilation speed for 0.2-0.7% application performance improvement.
I will ask someone to do our performance testing to confirm your results.
There were several optimization done for this code by @RealLucy JDK-8223444 and JDK-8231460. But it is still using
linked listfor free segments. Should we consider something more complex? Or it is not an issue?
During my testing of the mentioned fixes, I never saw such long freelists. They can only appear with really severe fragmentation in the code cache. Artificial creation would work like allocating many (small) code blobs and then freeing every other.
Adjacent free slots are fused (combined together) during free processing. During code blob allocation, the free list is searched first for a match. That usually helps well against a growing free list. To gain more insight, you may want to try -XX:+PrintCodeHeapAnalytics. The same information (with more options) is also available via jcmd.
CodeCacheSegmentSize should not be chosen too small to keep memory and processing overhead in check. 64 bytes appears to be a feasible choice.
I will ask someone to do our performance testing to confirm your results.
Good. Thank you very much. I am foxusing mainly for memory optimization, the performance improvement is a side effect. Anyway, more testing would be good. Thanks!
During my testing of the mentioned fixes, I never saw such long freelists.
I see the warning every time I run Renaissance benchmarks with the VerifyCodeCache option. Is this a problem and not just a reflection of the fact that the codeheap contains 20K of methods with some holes in between?
$ ./build/linux-aarch64-server-fastdebug/jdk/bin/java -XX:+VerifyCodeCache -XX:+PrintCodeCache -jar ~/renaissance-jmh-0.14.2-95-gae3b5ce.jar Dotty
# Run progress: 0.00% complete, ETA 00:00:00
# Fork: 1 of 5
# Warmup Iteration 1: 86079.453 ms/op
# Warmup Iteration 2: 8212.947 ms/op
# Warmup Iteration 3: 5416.496 ms/op
# Warmup Iteration 4: 6841.073 ms/op
# Warmup Iteration 5: 4248.337 ms/op
# Warmup Iteration 6: OpenJDK 64-Bit Server VM warning: CodeHeap: # of free blocks > 10000
3530.697 ms/op
# Warmup Iteration 7: 3186.153 ms/op
CodeCacheSegmentSize should not be chosen too small to keep memory and processing overhead in check. 64 bytes appears to be a feasible choice.
OK. Thanks!
The warning can be understood as a hint to a potential efficiency issue. There is no hard limit for #free blocks.
I do not have the Renaissance suite readily runnable at hand. Would you mind sending me the -XX:+PrintCodeHeapAnalytics output from such a run? You could use [email protected] to send it (not many people will be interested in it).
I don't quite remember making this change... And I don't remember any reasons as to why it might have been needed.
OK, thanks.
@bulasevich Thanks for providing the Code Heap Stats.
At vm shutdown time, when the stats were taken, there is no pathological state in the code heap.
- non-profiled nmethods has 70 free blocks, occupying 1033k in total.
- profiled nmethods has 525 free blocks, occupying 5550k in total.
- non-nmethods has 5 free blocks, occupying 263k in total.
The warning must be triggered by a transient situation, when many methods are removed from the code heap in a short period of time, without significant new method compilations. Eventually, this large number of free blocks will either be reallocated for new compilations or collapse into less, but larger, blocks when even more methods are removed from the code heap.
In short: no need to worry.
@RealLucy Good! Thanks for checking!
With this change I do not make things worse. Code Heap Stats numbers fluctuate wildly, but still look good with tuned CodeCacheSegmentSize and CodeEntryAlignment:
- non-profiled nmethods has 53 free blocks, occupying 23k in total.
- profiled nmethods has 199 free blocks, occupying 935k in total.
- non-nmethods has 13 free blocks, occupying 296k in total.
Were performance runs made with CodeEntryAlignment set to other than 64 or 16? It seems like the other choices (32, 128, are there others that make sense?) should be tried.
Were performance runs made with CodeEntryAlignment set to other than 64 or 16? It seems like the other choices (32, 128, are there others that make sense?) should be tried.
Here are rough neoverse-v2 numbers: JmhDotty (-XX:CodeEntryAlignment=16) 701.93 ± 5.00 ms/op JmhDotty (-XX:CodeEntryAlignment=32) 703.56 ± 5.15 ms/op JmhDotty (-XX:CodeEntryAlignment=64) 704.46 ± 5.18 ms/op JmhDotty (-XX:CodeEntryAlignment=128) 703.71 ± 5.17 ms/op
@bulasevich I just want to let you know that we continue our performance testing...
Hi @bulasevich
We got significant (few percents) regression when testing -XX:CodeCacheSegmentSize=64 -XX:CodeEntryAlignment=16 on Ampere system which has N1 CPU.
Is it possible to not set CodeEntryAlignment=16 for N1?
May be even limit it for V1 and V2 only?
@vnkozlov Many thanks! Do you reproduce the regression on a public benchmark that I can also try? Now I restrict CodeEntryAlignment=16 for V1 and V2 only. And I restart my performance tests.
Do you reproduce the regression on a public benchmark that I can also try?
It was our internal benchmark.
@vnkozlov Many thanks! Do you reproduce the regression on a public benchmark that I can also try? Now I restrict CodeEntryAlignment=16 for V1 and V2 only. And I restart my performance tests.
I think that's the wrong way around. The default should be 64, maybe with smaller settings where we know it helps performance. It makes little sense to set the default CodeEntryAlignment to less than the icache line size. except in severely constrained environments.
@vnkozlov Many thanks! Do you reproduce the regression on a public benchmark that I can also try? Now I restrict CodeEntryAlignment=16 for V1 and V2 only. And I restart my performance tests.
This may have as much to do with the smallish icache on the Ampere parts as the specific Arm microarchitecture, so nothing to do with N2.
@vnkozlov Many thanks! Do you reproduce the regression on a public benchmark that I can also try? Now I restrict CodeEntryAlignment=16 for V1 and V2 only. And I restart my performance tests.
This may have as much to do with the smallish icache
Sorry, I meant last level cache
@theRealAph
It makes little sense to set the default CodeEntryAlignment to less than the icache line size. except in severely constrained environments.
Why do we need CodeEntryAlignment? The instruction prefetcher has more time to load the next cache line if execution starts at the beginning of the current cache line. But this consideration makes more sense for OptoLoopAlignment. Ideally, the entire loop body fits into a limited number of instruction cache lines - this is unlikely to happen with the entire nmethod body.
I have experimented with code entry alignment on native application (repeatedly calling a large number of aligned/unaligned short methods) and found that for Neoverse N2 CPU 64-byte alignment is preferable, while no difference was observed for Neoverse V2. I am not sure if this is a feature of the processor implementation or a feature of the Neoverse architecture. The Neoverse N2/V2 technical reference manuals are pretty much the same about L1 instruction memory system features.
@bulasevich, @theRealAph, @vnkozlov I see that gcc uses "32:16" for all neoverses. This is aligned with what I found in the Neoverse-N1 Optimization Guide:
Consider aligning subroutine entry points and branch targets to 32B boundaries, within the bounds of the code-density requirements of the program. This will ensure that the subsequent fetch can maximize bandwidth following the taken branch by bringing in all useful instructions
It is interesting that I don't see this in the N2&V1&V2 Optimization guides.
IMO we should consider setting CodeEntryAlignmentto for Neoverses to 32 and keep 64 for others. Any objections?