jdk icon indicating copy to clipboard operation
jdk copied to clipboard

8339573: Update CodeCacheSegmentSize and CodeEntryAlignment for ARM

Open bulasevich opened this issue 1 year ago • 30 comments

With this change, I have adjusted the default settings for CodeCacheSegmentSize and CodeEntryAlignment for AARCH and ARM32. The main goal is to improve code density by reducing the number of wasted bytes (approximately 4% waste). Improving code density may also have the side effect of boosting performance in large applications

Each nmethod occupies a number of code cache segments (minimum allocation blocks). Since the size of an nmethod is not aligned to 128 bytes, the last occupied segment is half empty. Reducing the size of the code cache segments correspondingly minimizes waste. However, we should be careful about reducing the CodeCacheSegmentSize too much, as smaller segment sizes will increase the overhead of the CodeHeap::_segmap bitmap. A CodeCacheSegmentSize of 64 seems to be an optimal balance.

The current large default value for CodeCacheSegmentSize (64+64) was historically introduced with the comment "Tiered compilation has large code-entry alignment" which doesn't make much sense to me. The history of this comment and value is as follows:

  • The PPC port was introduced with CodeEntryAlignment=128 (recently reduced to 64: https://github.com/openjdk/jdk/commit/09a78b5d) and CodeCacheSegmentSize was adjusted accordingly for that platform.
  • Soon after, the 128-byte alignment was applied to all platforms to hide a debug mode warning (https://github.com/openjdk/jdk/commit/e8bc971d). Despite the change (and Segmented Code Cache introduced later), the warning can still be reproduced today using the -XX:+VerifyCodeCache fastdebug option in large applications (10K nmethods ~ 10K free blocks in between them).

I believe it is time to remove the comment and update the default value.

I also suggest updating the default CodeEntryAlignment value for AARCH. The current setting is much larger than for x86 and was likely based on the typical cache line size of 64 bytes. Cortex-A57, A72 architecture software optimisation guides recommend a 32-byte alignment for subroutine entry points. Neoverse architecture software optimisation guides do not mention recommended entry point alignment.

For reference, the default function_align setting in GCC is typically 16 or 32 bytes, depending on the target architecture.

Hotspot performance tests with -XX:CodeCacheSegmentSize=64 and -XX:CodeEntryAlignment=16 options showed the following results:

  • No performance impact on Hotspot microbenchmarks and other microbenchmarks.
  • On the Renaissance Dotty benchmark:
    • 0.2-0.4% performance improvement on Neoverse N1/V1/V2 architectures
    • 0.7% performance improvement on Raspberry Pi Model 3 (ARM32, ARM1176JZF-S)
    • slight performance degradation on Cortex-A72, reproducible only with the CodeEntryAlignment update.

I suggest changing the CodeCacheSegmentSize for AARCH64 and ARM32 and updating the CodeEntryAlignment for AARCH64 Neoverse platforms.


Progress

  • [ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • [x] Change must not contain extraneous whitespace
  • [x] Commit message must refer to an issue

Issue

  • JDK-8339573: Update CodeCacheSegmentSize and CodeEntryAlignment for ARM (Enhancement - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/20864/head:pull/20864
$ git checkout pull/20864

Update a local copy of the PR:
$ git checkout pull/20864
$ git pull https://git.openjdk.org/jdk.git pull/20864/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 20864

View PR using the GUI difftool:
$ git pr show -t 20864

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/20864.diff

Webrev

Link to Webrev Comment

bulasevich avatar Sep 05 '24 00:09 bulasevich

:wave: Welcome back bulasevich! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

bridgekeeper[bot] avatar Sep 05 '24 00:09 bridgekeeper[bot]

@bulasevich This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8339573: Update CodeCacheSegmentSize and CodeEntryAlignment for ARM

Reviewed-by: kvn, eastigeevich

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 762 new commits pushed to the master branch:

  • a95374f588149d80068275a496ba4aa04b3bb4fd: 8343101: Rework BasicTest.testTemp test cases
  • 00fe9f7bdfd245791bca6b5b1b2d0a98d41af221: 8343100: Consolidate EmptyFolderTest and EmptyFolderPackageTest jpackage tests into single java file
  • 9f6d5b46ce2cfcdb39f94b8ac8621ee21f4e8740: 8343020: (fs) Add support for SecureDirectoryStream on macOS
  • 1341b81321fe77005ba68fba19c7d83e3fcb5fde: 8341666: FileInputStream doesn't support readAllBytes() or readNBytes(int) on pseudo devices
  • 52382e285fdf853c01605f8e0d7f3f5d34965802: 8338021: Support new unsigned and saturating vector operators in VectorAPI
  • e659d9da5d6198ad9c85efd6472e138a6a3961c2: 8342975: C2: Micro-optimize PhaseIdealLoop::Dominators()
  • 9f6211bcf1b46e4bfba2d128d9eb8457bc0cde51: 8341371: CDS cannot load archived heap objects with -XX:+UseSerialGC -XX:-UseCompressedOops
  • 120a9357b3cf63427a6c8539128b69b11b9beca3: 8342561: Metaspace for generated reflection classes is no longer needed
  • d5fb6b4a3cf4926acb333e7ee55f96fc76225631: 8339939: [JVMCI] Don't compress abstract and interface Klasses
  • a5ad974bec932c63ddc647c9986a513ae32ef663: 8343056: C2: Micro-optimize Node lists grow
  • ... and 752 more: https://git.openjdk.org/jdk/compare/a7120e2b251e1337df5bd4a2808638d28b7d3bd3...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

openjdk[bot] avatar Sep 05 '24 01:09 openjdk[bot]

@bulasevich The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

openjdk[bot] avatar Sep 05 '24 01:09 openjdk[bot]

Webrevs

mlbridge[bot] avatar Sep 09 '24 16:09 mlbridge[bot]

Looks like good low-hanging fruit to me. Could we ask @veresov why he made this change? Thanks!

theRealAph avatar Sep 09 '24 18:09 theRealAph

I don't quite remember making this change... And I don't remember any reasons as to why it might have been needed.

veresov avatar Sep 09 '24 18:09 veresov

As @bulasevich pointed the change was done by Albert to fix VM warning: https://bugs.openjdk.org/browse/JDK-8029799

Quote: "The freelist of the code cache exceeds 10'000 items, which results in a VM warning. The problem behind the warning is that the freelist is populated by a large number of small free blocks. For example, in failing test case (see header), the freelist grows up to more than 3500 items where the largest item on the list is 9 segments (one segment is 64 bytes). That experiment was done on my laptop. Such a large freelist can indeed be a performance problem, since we use a linear search to traverse the freelist."

The warning is about huge freelist which is scanned linearly to find corresponding free space in CodeCache for next allocation. It is become big with tiered compilation because we do a lot of C1 compiled code which is replaced with C2 compiled code.

The fix for 8029799 did optimization for freelist search for allocation by selecting first which have enough space. This reduce time of search but on other hand may increase fragmentation of CodeCache space.

There were several optimization done for this code by @RealLucy JDK-8223444 and JDK-8231460. But it is still using linked list for free segments. Should we consider something more complex? Or it is not an issue?

vnkozlov avatar Sep 09 '24 19:09 vnkozlov

Hotspot performance tests with -XX:CodeCacheSegmentSize=64 and -XX:CodeEntryAlignment=16 options showed the following results:

Which of these two flags setting improved performance most?

vnkozlov avatar Sep 09 '24 19:09 vnkozlov

Hotspot performance tests with -XX:CodeCacheSegmentSize=64 and -XX:CodeEntryAlignment=16 options showed the following results:

Which of these two flags setting improved performance most?

There is a high noise in the benchmark. With 1K of iterations on Neoverse V2 machine the result is:

vm options  ms/op % cache-misses (M) group0-code_sparsity (K) L1-icache-load-misses:u (M)
default 635.574   7193 18152 10111
CodeEntryAlignment 633.961 -0.25 7177 18156 10334
CodeCacheSegmentSize 633.041 -0.15 7187 17919 10081
both 631.687 -0.21 7167 17796 10287

bulasevich avatar Sep 09 '24 20:09 bulasevich

The warning is about huge freelist which is scanned linearly to find corresponding free space in CodeCache for next allocation. It is become big with tiered compilation because we do a lot of C1 compiled code which is replaced with C2 compiled code.

With Segmented Code Cache we have separate freelists for profiled and non-profiled heap. So I think there is no need to correct segment size for tiered compilation.

But it is still using linked list for free segments. Should we consider something more complex? Or it is not an issue?

This can indeed be a problem. CodeHeap allocation takes about 1% of the total compilation time. Here is my statistic for 50K compiled methods: before: total_compilation_time(s): 370, total_allocation_time(s): 1.6 after: total_compilation_time(s): 378, total_allocation_time(s): 3.4

bulasevich avatar Sep 09 '24 22:09 bulasevich

With Segmented Code Cache we have separate freelists for profiled and non-profiled heap. So I think there is no need to correct segment size for tiered compilation.

It is indeed helped somewhat but as you said:

the warning can still be reproduced today using the -XX:+VerifyCodeCache fastdebug option in large applications (10K nmethods ~ 10K free blocks in between them).

Anyway, I agree to trade off 1% in compilation speed for 0.2-0.7% application performance improvement.

vnkozlov avatar Sep 10 '24 01:09 vnkozlov

I will ask someone to do our performance testing to confirm your results.

vnkozlov avatar Sep 10 '24 01:09 vnkozlov

There were several optimization done for this code by @RealLucy JDK-8223444 and JDK-8231460. But it is still using linked list for free segments. Should we consider something more complex? Or it is not an issue?

During my testing of the mentioned fixes, I never saw such long freelists. They can only appear with really severe fragmentation in the code cache. Artificial creation would work like allocating many (small) code blobs and then freeing every other.

Adjacent free slots are fused (combined together) during free processing. During code blob allocation, the free list is searched first for a match. That usually helps well against a growing free list. To gain more insight, you may want to try -XX:+PrintCodeHeapAnalytics. The same information (with more options) is also available via jcmd.

CodeCacheSegmentSize should not be chosen too small to keep memory and processing overhead in check. 64 bytes appears to be a feasible choice.

RealLucy avatar Sep 10 '24 10:09 RealLucy

I will ask someone to do our performance testing to confirm your results.

Good. Thank you very much. I am foxusing mainly for memory optimization, the performance improvement is a side effect. Anyway, more testing would be good. Thanks!

bulasevich avatar Sep 10 '24 11:09 bulasevich

During my testing of the mentioned fixes, I never saw such long freelists.

I see the warning every time I run Renaissance benchmarks with the VerifyCodeCache option. Is this a problem and not just a reflection of the fact that the codeheap contains 20K of methods with some holes in between?

$ ./build/linux-aarch64-server-fastdebug/jdk/bin/java -XX:+VerifyCodeCache -XX:+PrintCodeCache -jar ~/renaissance-jmh-0.14.2-95-gae3b5ce.jar Dotty
# Run progress: 0.00% complete, ETA 00:00:00
# Fork: 1 of 5
# Warmup Iteration   1: 86079.453 ms/op
# Warmup Iteration   2: 8212.947 ms/op
# Warmup Iteration   3: 5416.496 ms/op
# Warmup Iteration   4: 6841.073 ms/op
# Warmup Iteration   5: 4248.337 ms/op
# Warmup Iteration   6: OpenJDK 64-Bit Server VM warning: CodeHeap: # of free blocks > 10000
3530.697 ms/op
# Warmup Iteration   7: 3186.153 ms/op

CodeCacheSegmentSize should not be chosen too small to keep memory and processing overhead in check. 64 bytes appears to be a feasible choice.

OK. Thanks!

bulasevich avatar Sep 10 '24 11:09 bulasevich

The warning can be understood as a hint to a potential efficiency issue. There is no hard limit for #free blocks.

I do not have the Renaissance suite readily runnable at hand. Would you mind sending me the -XX:+PrintCodeHeapAnalytics output from such a run? You could use [email protected] to send it (not many people will be interested in it).

RealLucy avatar Sep 10 '24 12:09 RealLucy

I don't quite remember making this change... And I don't remember any reasons as to why it might have been needed.

OK, thanks.

theRealAph avatar Sep 10 '24 12:09 theRealAph

@bulasevich Thanks for providing the Code Heap Stats.

At vm shutdown time, when the stats were taken, there is no pathological state in the code heap.

  • non-profiled nmethods has 70 free blocks, occupying 1033k in total.
  • profiled nmethods has 525 free blocks, occupying 5550k in total.
  • non-nmethods has 5 free blocks, occupying 263k in total.

The warning must be triggered by a transient situation, when many methods are removed from the code heap in a short period of time, without significant new method compilations. Eventually, this large number of free blocks will either be reallocated for new compilations or collapse into less, but larger, blocks when even more methods are removed from the code heap.

In short: no need to worry.

RealLucy avatar Sep 10 '24 15:09 RealLucy

@RealLucy Good! Thanks for checking!

With this change I do not make things worse. Code Heap Stats numbers fluctuate wildly, but still look good with tuned CodeCacheSegmentSize and CodeEntryAlignment:

  • non-profiled nmethods has 53 free blocks, occupying 23k in total.
  • profiled nmethods has 199 free blocks, occupying 935k in total.
  • non-nmethods has 13 free blocks, occupying 296k in total.

bulasevich avatar Sep 10 '24 16:09 bulasevich

Were performance runs made with CodeEntryAlignment set to other than 64 or 16? It seems like the other choices (32, 128, are there others that make sense?) should be tried.

Peter-B-Kessler avatar Sep 10 '24 19:09 Peter-B-Kessler

Were performance runs made with CodeEntryAlignment set to other than 64 or 16? It seems like the other choices (32, 128, are there others that make sense?) should be tried.

Here are rough neoverse-v2 numbers: JmhDotty (-XX:CodeEntryAlignment=16) 701.93 ± 5.00 ms/op JmhDotty (-XX:CodeEntryAlignment=32) 703.56 ± 5.15 ms/op JmhDotty (-XX:CodeEntryAlignment=64) 704.46 ± 5.18 ms/op JmhDotty (-XX:CodeEntryAlignment=128) 703.71 ± 5.17 ms/op

bulasevich avatar Sep 11 '24 14:09 bulasevich

@bulasevich I just want to let you know that we continue our performance testing...

vnkozlov avatar Sep 12 '24 19:09 vnkozlov

Hi @bulasevich We got significant (few percents) regression when testing -XX:CodeCacheSegmentSize=64 -XX:CodeEntryAlignment=16 on Ampere system which has N1 CPU.
Is it possible to not set CodeEntryAlignment=16 for N1? May be even limit it for V1 and V2 only?

vnkozlov avatar Sep 12 '24 23:09 vnkozlov

@vnkozlov Many thanks! Do you reproduce the regression on a public benchmark that I can also try? Now I restrict CodeEntryAlignment=16 for V1 and V2 only. And I restart my performance tests.

bulasevich avatar Sep 13 '24 12:09 bulasevich

Do you reproduce the regression on a public benchmark that I can also try?

It was our internal benchmark.

vnkozlov avatar Sep 13 '24 16:09 vnkozlov

@vnkozlov Many thanks! Do you reproduce the regression on a public benchmark that I can also try? Now I restrict CodeEntryAlignment=16 for V1 and V2 only. And I restart my performance tests.

I think that's the wrong way around. The default should be 64, maybe with smaller settings where we know it helps performance. It makes little sense to set the default CodeEntryAlignment to less than the icache line size. except in severely constrained environments.

theRealAph avatar Sep 14 '24 08:09 theRealAph

@vnkozlov Many thanks! Do you reproduce the regression on a public benchmark that I can also try? Now I restrict CodeEntryAlignment=16 for V1 and V2 only. And I restart my performance tests.

This may have as much to do with the smallish icache on the Ampere parts as the specific Arm microarchitecture, so nothing to do with N2.

theRealAph avatar Sep 14 '24 10:09 theRealAph

@vnkozlov Many thanks! Do you reproduce the regression on a public benchmark that I can also try? Now I restrict CodeEntryAlignment=16 for V1 and V2 only. And I restart my performance tests.

This may have as much to do with the smallish icache

Sorry, I meant last level cache

theRealAph avatar Sep 14 '24 16:09 theRealAph

@theRealAph

It makes little sense to set the default CodeEntryAlignment to less than the icache line size. except in severely constrained environments.

Why do we need CodeEntryAlignment? The instruction prefetcher has more time to load the next cache line if execution starts at the beginning of the current cache line. But this consideration makes more sense for OptoLoopAlignment. Ideally, the entire loop body fits into a limited number of instruction cache lines - this is unlikely to happen with the entire nmethod body.

I have experimented with code entry alignment on native application (repeatedly calling a large number of aligned/unaligned short methods) and found that for Neoverse N2 CPU 64-byte alignment is preferable, while no difference was observed for Neoverse V2. I am not sure if this is a feature of the processor implementation or a feature of the Neoverse architecture. The Neoverse N2/V2 technical reference manuals are pretty much the same about L1 instruction memory system features.

bulasevich avatar Sep 16 '24 18:09 bulasevich

@bulasevich, @theRealAph, @vnkozlov I see that gcc uses "32:16" for all neoverses. This is aligned with what I found in the Neoverse-N1 Optimization Guide:

Consider aligning subroutine entry points and branch targets to 32B boundaries, within the bounds of the code-density requirements of the program. This will ensure that the subsequent fetch can maximize bandwidth following the taken branch by bringing in all useful instructions

It is interesting that I don't see this in the N2&V1&V2 Optimization guides.

IMO we should consider setting CodeEntryAlignmentto for Neoverses to 32 and keep 64 for others. Any objections?

eastig avatar Oct 03 '24 22:10 eastig