8372701: Randomized profile counters
Please use this link to view the files changed.
Profile counters scale very badly.
The overhead for profiled code isn't too bad with one thread, but as the thread count increases, things go wrong very quickly.
For example, here's a benchmark from the OpenJDK test suite, run at TieredLevel 3 with one thread, then three threads:
Benchmark (randomized) Mode Cnt Score Error Units
InterfaceCalls.test2ndInt5Types false avgt 4 27.468 ± 2.631 ns/op
InterfaceCalls.test2ndInt5Types false avgt 4 240.010 ± 6.329 ns/op
This slowdown is caused by high memory contention on the profile counters. Not only is this slow, but it can also lose profile counts.
This patch is for C1 only. It'd be easy to randomize C1 counters as well in another PR, if anyone thinks it's worth doing.
One other thing to note is that randomized profile counters degrade very badly with small decimation ratios. For example, using a ratio of 2 with -XX:ProfileCaptureRatio=2 with a single thread results in
Benchmark (randomized) Mode Cnt Score Error Units
InterfaceCalls.test2ndInt5Types false avgt 4 80.147 ± 9.991 ns/op
The problem is that the branch prediction rate drops away very badly, leading to many mispredictions. It only really makes sense to use higher decimation ratios, e.g. 64.
Progress
- [ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
- [x] Change must not contain extraneous whitespace
- [x] Commit message must refer to an issue
Issue
- JDK-8372701: Randomized profile counters (Enhancement - P4)
Reviewing
Using git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/28541/head:pull/28541
$ git checkout pull/28541
Update a local copy of the PR:
$ git checkout pull/28541
$ git pull https://git.openjdk.org/jdk.git pull/28541/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 28541
View PR using the GUI difftool:
$ git pr show -t 28541
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/28541.diff
Using Webrev
:wave: Welcome back aph! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.
❗ This change is not yet ready to be integrated. See the Progress checklist in the description for automated requirements.
@theRealAph this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:
git checkout JDK-8134940
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push
⚠️ @theRealAph This pull request contains merges that bring in commits not present in the target repository. Since this is not a "merge style" pull request, these changes will be squashed when this pull request in integrated. If this is your intention, then please ignore this message. If you want to preserve the commit structure, you must change the title of this pull request to Merge <project>:<branch> where <project> is the name of another project in the OpenJDK organization (for example Merge jdk:master).
@theRealAph The following label will be automatically applied to this pull request:
-
hotspot
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.
Webrevs
Impressive work.
Clashes a bit with https://github.com/openjdk/jdk/pull/25305/, which commons the type profile check and makes it more robust. It would be trivial to resolve, as that PR has only one place where counter is updated. Also gives you some additional budget to spare for more instructions in profiled code. So it would be nice if that PR (and probably its AArch64 version) lands first.
Impressive work.
Clashes a bit with #25305, which commons the type profile check and makes it more robust. It would be trivial to resolve, as that PR has only one place where counter is updated. Also gives you some additional budget to spare for more instructions in profiled code. So it would be nice if that PR (and probably its AArch64 version) lands first.
Thanks.
Sure, it can wait for that PR.
The inlined profile update code is moved to a stub, then in its place we put:
ubfx x8, rng, #26, #6 // extract the top 6 bits of the random-number generator
cbz x8, update // if they are zero, jump to the stub that updates the profile counter
next_random rng // generate the next random number
At the moment, several C2 IR tests fail with randomized profile counters because they are acutely sensitive to small changes in profile counts. I think this can probably be fixed.
Also, I believe there are some kinds of event that should never be missed, even when subsampling profile counters in this way. I'd like people to advise me which events these are.
I have only made the back-end changes to AArch64 and x86. The back-end changes are simple to make for other architectures, and will need to be done if this PR is to be merged into mainline.
Also, I believe there are some kinds of event that should never be missed, even when subsampling profile counters in this way. I'd like people to advise me which events these are
One other thing that comes into mind: the initial swing from 0 -> 1 for a type counter is important, since 0 means "never seen the type at all", and >0 means "maybe the type is present, however rare". I would suspect subsampling a small count to 0 would cause performance anomalies. Especially if, say, this anomaly causes a deopt - reprofile - compile cycle. It would doubly hurt, if reprofile would miss the type again. Probably hard to do with RNG, but maybe we should be doing the initial counter seed on installation without consulting RNG. I don't think current patch does it, but maybe I am looking at the wrong place. Would be fairly trivial to do after https://github.com/openjdk/jdk/pull/25305.
Also, I believe there are some kinds of event that should never be missed, even when subsampling profile counters in this way. I'd like people to advise me which events these are
One other thing that comes into mind: the initial swing from
0->1for a type counter is important, since0means "never seen the type at all", and>0means "maybe the type is present, however rare". I would suspect subsampling a small count to0would cause performance anomalies. Especially if, say, this anomaly causes a deopt - reprofile - compile cycle. It would doubly hurt, if reprofile would miss the type again. Probably hard to do with RNG, but maybe we should be doing the initial counter seed on installation without consulting RNG. I don't think current patch does it, but maybe I am looking at the wrong place. Would be fairly trivial to do after #25305.
OK, all useful thoughts. I'll have a look.
Happy to see a serious contender for a resolution to this long-standing issue. While it's a bit unclear how problematic it is in practice we see issues related to this in thread-heavy benchmarks (such as SPECjvm2008) regularly.
It'd be easy to randomize C1 counters as well in another PR, if anyone thinks it's worth doing.
I assume you mean interpreter counters?
Happy to see a serious contender for a resolution to this long-standing issue. While it's a bit unclear how problematic it is in practice we see issues related to this in thread-heavy benchmarks (such as SPECjvm2008) regularly.
It'd be easy to randomize C1 counters as well in another PR, if anyone thinks it's worth doing.
I assume you mean interpreter counters?
Oops. yes, of course, thanks!
I can run our internal performance testing with this but it currently fails to build on AArch64:
[2025-12-03T11:49:29,644Z] * For target hotspot_variant-server_libjvm_objs_c1_LIRAssembler_aarch64.o:
[2025-12-03T11:49:29,644Z] /System/Volumes/Data/mesos/work_dir/slaves/da1065b5-7b94-4f0d-85e9-a3a252b9a32e-S5842/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/a3ab2e9e-0898-4ceb-94ab-4f606db9de4d/runs/44169997-4fbe-4f98-98b9-d11781843c5e/workspace/open/src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.cpp:2739:18: error: lambda capture 'op' is not used [-Werror,-Wunused-lambda-capture]
[2025-12-03T11:49:29,644Z] auto lambda = [op, stub] (LIR_Assembler* ce, LIR_Op* base_op) {
[2025-12-03T11:49:29,644Z] ^~~
[2025-12-03T11:49:29,644Z] 1 error generated.
[2025-12-03T11:49:29,644Z] * For target hotspot_variant-server_libjvm_objs_static_c1_LIRAssembler_aarch64.o:
[2025-12-03T11:49:29,644Z] /System/Volumes/Data/mesos/work_dir/slaves/da1065b5-7b94-4f0d-85e9-a3a252b9a32e-S5842/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/a3ab2e9e-0898-4ceb-94ab-4f606db9de4d/runs/44169997-4fbe-4f98-98b9-d11781843c5e/workspace/open/src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.cpp:2739:18: error: lambda capture 'op' is not used [-Werror,-Wunused-lambda-capture]
[2025-12-03T11:49:29,644Z] auto lambda = [op, stub] (LIR_Assembler* ce, LIR_Op* base_op) {
[2025-12-03T11:49:29,644Z] ^~~
[2025-12-03T11:49:29,644Z] 1 error generated.
I can run our internal performance testing with this but it currently fails to build on AArch64:
Done, thank you.
Thanks, still fails though:
[2025-12-04T13:35:07,965Z] * For target hotspot_variant-server_libjvm_objs_c1_MacroAssembler_aarch64.o:
[2025-12-04T13:35:07,965Z] /opt/mach5/mesos/work_dir/slaves/da1065b5-7b94-4f0d-85e9-a3a252b9a32e-S11677/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/aaf0c312-b425-4910-a568-48e356b81714/runs/ffe8e70a-35bd-4ae4-bed8-85bd71130e51/workspace/open/src/hotspot/cpu/aarch64/c1_MacroAssembler_aarch64.cpp: In member function 'void C1_MacroAssembler::save_profile_rng()':
[2025-12-04T13:35:07,965Z] /opt/mach5/mesos/work_dir/slaves/da1065b5-7b94-4f0d-85e9-a3a252b9a32e-S11677/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/aaf0c312-b425-4910-a568-48e356b81714/runs/ffe8e70a-35bd-4ae4-bed8-85bd71130e51/workspace/open/src/hotspot/cpu/aarch64/c1_MacroAssembler_aarch64.cpp:302:54: error: 'profile_rng_offset' is not a member of 'JavaThread'
[2025-12-04T13:35:07,965Z] 302 | strw(r_profile_rng, Address(rthread, JavaThread::profile_rng_offset()));
[2025-12-04T13:35:07,965Z] | ^~~~~~~~~~~~~~~~~~
[2025-12-04T13:35:07,965Z] /opt/mach5/mesos/work_dir/slaves/da1065b5-7b94-4f0d-85e9-a3a252b9a32e-S11677/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/aaf0c312-b425-4910-a568-48e356b81714/runs/ffe8e70a-35bd-4ae4-bed8-85bd71130e51/workspace/open/src/hotspot/cpu/aarch64/c1_MacroAssembler_aarch64.cpp: In member function 'void C1_MacroAssembler::restore_profile_rng()':
[2025-12-04T13:35:07,965Z] /opt/mach5/mesos/work_dir/slaves/da1065b5-7b94-4f0d-85e9-a3a252b9a32e-S11677/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/aaf0c312-b425-4910-a568-48e356b81714/runs/ffe8e70a-35bd-4ae4-bed8-85bd71130e51/workspace/open/src/hotspot/cpu/aarch64/c1_MacroAssembler_aarch64.cpp:308:54: error: 'profile_rng_offset' is not a member of 'JavaThread'
[2025-12-04T13:35:07,965Z] 308 | ldrw(r_profile_rng, Address(rthread, JavaThread::profile_rng_offset()));
[2025-12-04T13:35:07,965Z] | ^~~~~~~~~~~~~~~~~~
[2025-12-04T13:35:07,965Z] * For target hotspot_variant-server_libjvm_objs_static_c1_MacroAssembler_aarch64.o:
[2025-12-04T13:35:07,965Z] /opt/mach5/mesos/work_dir/slaves/da1065b5-7b94-4f0d-85e9-a3a252b9a32e-S11677/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/aaf0c312-b425-4910-a568-48e356b81714/runs/ffe8e70a-35bd-4ae4-bed8-85bd71130e51/workspace/open/src/hotspot/cpu/aarch64/c1_MacroAssembler_aarch64.cpp: In member function 'void C1_MacroAssembler::save_profile_rng()':
[2025-12-04T13:35:07,966Z] /opt/mach5/mesos/work_dir/slaves/da1065b5-7b94-4f0d-85e9-a3a252b9a32e-S11677/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/aaf0c312-b425-4910-a568-48e356b81714/runs/ffe8e70a-35bd-4ae4-bed8-85bd71130e51/workspace/open/src/hotspot/cpu/aarch64/c1_MacroAssembler_aarch64.cpp:302:54: error: 'profile_rng_offset' is not a member of 'JavaThread'
[2025-12-04T13:35:07,966Z] 302 | strw(r_profile_rng, Address(rthread, JavaThread::profile_rng_offset()));
[2025-12-04T13:35:07,966Z] | ^~~~~~~~~~~~~~~~~~
[2025-12-04T13:35:07,966Z] /opt/mach5/mesos/work_dir/slaves/da1065b5-7b94-4f0d-85e9-a3a252b9a32e-S11677/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/aaf0c312-b425-4910-a568-48e356b81714/runs/ffe8e70a-35bd-4ae4-bed8-85bd71130e51/workspace/open/src/hotspot/cpu/aarch64/c1_MacroAssembler_aarch64.cpp: In member function 'void C1_MacroAssembler::restore_profile_rng()':
[2025-12-04T13:35:07,966Z] /opt/mach5/mesos/work_dir/slaves/da1065b5-7b94-4f0d-85e9-a3a252b9a32e-S11677/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/aaf0c312-b425-4910-a568-48e356b81714/runs/ffe8e70a-35bd-4ae4-bed8-85bd71130e51/workspace/open/src/hotspot/cpu/aarch64/c1_MacroAssembler_aarch64.cpp:308:54: error: 'profile_rng_offset' is not a member of 'JavaThread'
[2025-12-04T13:35:07,966Z] 308 | ldrw(r_profile_rng, Address(rthread, JavaThread::profile_rng_offset()));
[2025-12-04T13:35:07,966Z] | ^~~~~~~~~~~~~~~~~~
[2025-12-04T13:35:07,966Z]
Thanks, still fails though:
Sure, but which configuration is that? I don't see such a failure in the Github tests.
Thanks, still fails though:
Sure, but which configuration is that? I don't see such a failure in the Github tests.
I just did a macos x86 and aarch64 minimal build, and it's fine too. I have no idea what you are seeing
That was on Linux AArch64. Let me re-run.
Ah, I think the problem is that there are merge conflicts with master. Could you please resolve them?
Ah, I think the problem is that there are merge conflicts with master. Could you please resolve them?
Done.
Thanks. Correctness testing is all clean on our side, I submitted benchmarks and will report back once it finished.
One other thing that comes into mind: the initial swing from
0->1for a type counter is important, since0means "never seen the type at all", and>0means "maybe the type is present, however rare". I would suspect subsampling a small count to0would cause performance anomalies. Especially if, say, this anomaly causes a deopt - reprofile - compile cycle. It would doubly hurt, if reprofile would miss the type again. Probably hard to do with RNG, but maybe we should be doing the initial counter seed on installation without consulting RNG.
I've been thinking about this some more, and I wonder how important it really is. Let's say we don't compile a method until it's been interpreted a couple of hundred times (it's at least 128). We're speculating that the call site is polymorphic, but so far has been called only monomorpically, so we need to check every invocation, just in case. I guess this does happen occasionally during some application warmup scenarios, but does it really matter? I guess we could try a big performance test suite to learn if with (say) 64-times decimation we see more recompilation.
I've been thinking about this some more, and I wonder how important it really is. Let's say we don't compile a method until it's been interpreted a couple of hundred times (it's at least 128). We're speculating that the call site is polymorphic, but so far has been called only monomorpically, so we need to check every invocation, just in case. I guess this does happen occasionally during some application warmup scenarios, but does it really matter?
For steady state it does not matter. But for warmup, Leyden teaches us (we were whack-a-mole-ing problems like these for the better part of the year there) that misguided trap-recompilation trip through C2 costs a lot. The compiler dynamics gets so funky that you start looking out for things that look "probably fine" on paper, but may conspire against you every so often. This looks to me as one of those things.
To make matters worse, for the applications that have clearly defined warmup/steady states, there is code that would execute only during warmup. Think initialization code that takes a particular path once and only once. For warmup in AOT mode, you really want that code to be generated ahead of time. Because it defeats the purpose of AOT to spend lots of JIT compilation time recompiling for a one-off initialization case. Which forces AOT code to be compiled more pessimistically. I can see how missing a rare receiver sets up AOT compilation for overly optimistic compilation that would trap at runtime, and at the worst time -- at warmup -- when compilers are already burning up.
In other words, that "does happen occasionally during some application warmup scenarios" is one of the things that Leyden tries to summarily avoid.
To your example: indeed, there is no recourse in case when really-polymorphic site is accidentally looking monomorphic due to code behavior artifacts: e.g. no one came with rare type just yet. But IMO that does not mean we would be opening more performance trap-doors when some code did come with the rare type, especially if it is easy to handle.
The fact that some of your horses might have bolted, does not give you a good reason to open the barn door a bit wider :)
To your example: indeed, there is no recourse in case when really-polymorphic site is accidentally looking monomorphic due to code behavior artifacts: e.g. no one came with rare type just yet. But IMO that does not mean we would be opening more performance trap-doors when some code did come with the rare type, especially if it is easy to handle.
Thanks for this input, it helps a lot.
How cheap it is always to update type profile counters depends on how many threads are racily updating them. But it's easy to move type-profiling code from behind the random step to in front of it, so I'll make that change.
The fact that some of your horses might have bolted, does not give you a good reason to open the barn door a bit wider
Agree totally in principle, but that analogy only works if the cost of closing the door is near-zero. It may well be so, we'll see.
Thanks. Correctness testing is all clean on our side, I submitted benchmarks and will report back once it finished.
Can you please run benchmarks with -XX:+UnlockExperimentalVMOptions -XX:ProfileCaptureRatio=64? Thanks.
How cheap it is always to update type profile counters depends on how many threads are racily updating them.
Sorry, my brain fart. We only need to read the classes before the random step, so there is no scaling problem.
How cheap it is always to update type profile counters depends on how many threads are racily updating them.
Sorry, my brain fart. We only need to read the classes before the random step, so there is no scaling problem.
Yes. And I hope after https://github.com/openjdk/jdk/pull/25305 you can really just specialize installation code a little: that code already knows whether it is about to install new receiver type in the table (so it can just write 1), or it is an increment of known receiver (which can go RNG route). The poly counter would need some thinking about.
Yes. And I hope after #25305 you can really just specialize installation code a little: that code already knows whether it is about to install new receiver type in the table (so it can just write
1), or it is an increment of known receiver (which can go RNG route). The poly counter would need some thinking about.
Sure, I'm excited to do that. Please merge as soon as you can. Do you intend to do the same for AArch64? I volunteer, if you like.