jdk 8291555: Replace stack-locking with fast-locking

This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable.

What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock.

This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does not overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols.

In contrast to stack-locking, fast-locking does not support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking.

One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread.

As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it.

This change enables to simplify (and speed-up!) a lot of code:

The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header.
Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR

Benchmarks

All benchmarks are run on server-class metal machines. The JVM settings are always: -Xmx20g -Xms20g -XX:+UseParallelGC. All benchmarks are ms/ops, less is better.

DaCapo/AArch64

Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias against fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference?

benchmark	baseline	fast-locking	%	size
avrora	27859	27563	1.07%	large
batik	20786	20847	-0.29%	large
biojava	27421	27334	0.32%	default
eclipse	59918	60522	-1.00%	large
fop	3670	3678	-0.22%	default
graphchi	2088	2060	1.36%	default
h2	297391	291292	2.09%	huge
jme	8762	8877	-1.30%	default
jython	18938	18878	0.32%	default
luindex	1339	1325	1.06%	default
lusearch	918	936	-1.92%	default
pmd	58291	58423	-0.23%	large
sunflow	32617	24961	30.67%	large
tomcat	25481	25992	-1.97%	large
tradebeans	314640	311706	0.94%	huge
tradesoap	107473	110246	-2.52%	huge
xalan	6047	5882	2.81%	default
zxing	970	926	4.75%	default

DaCapo/x86_64

The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above.

benchmark	baseline	fast-Locking	%	size
avrora	127690	126749	0.74%	large
batik	12736	12641	0.75%	large
biojava	15423	15404	0.12%	default
eclipse	41174	41498	-0.78%	large
fop	2184	2172	0.55%	default
graphchi	1579	1560	1.22%	default
h2	227614	230040	-1.05%	huge
jme	8591	8398	2.30%	default
jython	13473	13356	0.88%	default
luindex	824	813	1.35%	default
lusearch	962	968	-0.62%	default
pmd	40827	39654	2.96%	large
sunflow	53362	43475	22.74%	large
tomcat	27549	28029	-1.71%	large
tradebeans	190757	190994	-0.12%	huge
tradesoap	68099	67934	0.24%	huge
xalan	7969	8178	-2.56%	default
zxing	1176	1148	2.44%	default

Renaissance/AArch64

This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings.

benchmark	baseline	fast-locking	%
AkkaUct	2558.832	2513.594	1.80%
Reactors	14715.626	14311.246	2.83%
Als	1851.485	1869.622	-0.97%
ChiSquare	1007.788	1003.165	0.46%
GaussMix	1157.491	1149.969	0.65%
LogRegression	717.772	733.576	-2.15%
MovieLens	7916.181	8002.226	-1.08%
NaiveBayes	395.296	386.611	2.25%
PageRank	4294.939	4346.333	-1.18%
FjKmeans	496.076	493.873	0.45%
FutureGenetic	2578.504	2589.255	-0.42%
Mnemonics	4898.886	4903.689	-0.10%
ParMnemonics	4260.507	4210.121	1.20%
Scrabble	139.37	138.312	0.76%
RxScrabble	320.114	322.651	-0.79%
Dotty	1056.543	1068.492	-1.12%
ScalaDoku	3443.117	3449.477	-0.18%
ScalaKmeans	259.384	258.648	0.28%
Philosophers	24333.311	23438.22	3.82%
ScalaStmBench7	1102.43	1115.142	-1.14%
FinagleChirper	6814.192	6853.38	-0.57%
FinagleHttp	4762.902	4807.564	-0.93%

Renaissance/x86_64

benchmark	baseline	fast-locking	%
AkkaUct	1117.185	1116.425	0.07%
Reactors	11561.354	11812.499	-2.13%
Als	1580.838	1575.318	0.35%
ChiSquare	459.601	467.109	-1.61%
GaussMix	705.944	685.595	2.97%
LogRegression	659.944	656.428	0.54%
MovieLens	7434.303	7592.271	-2.08%
NaiveBayes	413.482	417.369	-0.93%
PageRank	3259.233	3276.589	-0.53%
FjKmeans	946.429	938.991	0.79%
FutureGenetic	1760.672	1815.272	-3.01%
ParMnemonics	2016.917	2033.101	-0.80%
Scrabble	147.996	150.084	-1.39%
RxScrabble	177.755	177.956	-0.11%
Dotty	673.754	683.919	-1.49%
ScalaDoku	2193.562	1958.419	12.01%
ScalaKmeans	165.376	168.925	-2.10%
ScalaStmBench7	1080.187	1049.184	2.95%
Philosophers	14268.449	13308.87	7.21%
FinagleChirper	4722.13	4688.3	0.72%
FinagleHttp	3497.241	3605.118	-2.99%

Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often.

I have also run another benchmark, which is a popular Java JVM benchmark, with workloads wrapped in JMH and very slightly modified to run with newer JDKs, but I won't publish the results because I am not sure about the licensing terms. They look similar to the measurements above (i.e. +/- 2%, nothing very suspicious).

Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here.

Testing

[x] tier1 (x86_64, aarch64, x86_32)
[x] tier2 (x86_64, aarch64)
[x] tier3 (x86_64, aarch64)
[x] tier4 (x86_64, aarch64)

Progress

[ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
[x] Change must not contain extraneous whitespace
[x] Commit message must refer to an issue

Issue

JDK-8291555: Replace stack-locking with fast-locking

Contributors

Aleksey Shipilev <[email protected]>

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/10590/head:pull/10590
$ git checkout pull/10590

Update a local copy of the PR:
$ git checkout pull/10590
$ git pull https://git.openjdk.org/jdk pull/10590/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 10590

View PR using the GUI difftool:
$ git pr show -t 10590

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/10590.diff

Oct 06 '22 10:10 rkennke

:wave: Welcome back rkennke! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

Oct 06 '22 10:10 bridgekeeper[bot]

@rkennke The following labels will be automatically applied to this pull request:

hotspot
serviceability
shenandoah

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

Oct 06 '22 10:10 openjdk[bot]

Webrevs

Oct 06 '22 10:10 mlbridge[bot]

⚠️ @rkennke This pull request contains merges that bring in commits not present in the target repository. Since this is not a "merge style" pull request, these changes will be squashed when this pull request in integrated. If this is your intention, then please ignore this message. If you want to preserve the commit structure, you must change the title of this pull request to Merge <project>:<branch> where <project> is the name of another project in the OpenJDK organization (for example Merge jdk:master).

Oct 11 '22 19:10 openjdk[bot]

Regarding benchmarks, is it possible to get some indication what fast-locking+lillput result will be? FinagleHttp seems to suffer a bit, will Lillput give some/all of that back, or more?

Oct 11 '22 20:10 robehn

Here is the basic support for RISC-V: https://cr.openjdk.java.net/~shade/8291555/riscv-patch-1.patch

-- I adapted this from AArch64 changes, and tested it very lightly. @RealFYang, can I leave the testing and follow up fixes to you?

Oct 12 '22 11:10 shipilev

/contributor add shipilev

Oct 13 '22 07:10 rkennke

@rkennke shipilev was not found in the census.

Syntax: /contributor (add|remove) [@user | openjdk-user | Full Name <email@address>]. For example:

/contributor add @openjdk-bot
/contributor add duke
/contributor add J. Duke <[email protected]>

User names can only be used for users in the census associated with this repository. For other contributors you need to supply the full name and email address.

Oct 13 '22 07:10 openjdk[bot]

/contributor add shade

Oct 13 '22 07:10 rkennke

@rkennke Contributor Aleksey Shipilev <[email protected]> successfully added.

Oct 13 '22 07:10 openjdk[bot]

On aarch64 (linux and mac) I see these variations of crashes in random tests: (asserts in debug, crash in release it looks like)

#  Internal Error .... src/hotspot/share/c1/c1_Runtime1.cpp:768), pid=2884803, tid=2884996
#  assert(oopDesc::is_oop(oop(obj))) failed: must be NULL or an object: 0x000000000000dead
# V  [libjvm.so+0x7851d4]  Runtime1::monitorexit(JavaThread*, oopDesc*)+0x110

#  SIGSEGV (0xb) at pc=0x0000fffc9d4e3de8, pid=1842880, tid=1842994
# V  [libjvm.so+0xbf3de8]  SharedRuntime::monitor_exit_helper(oopDesc*, JavaThread*)+0x24

#  SIGSEGV (0xb) at pc=0x0000fffca9f00394, pid=959883, tid=959927
# V  [libjvm.so+0xc90394]  ObjectSynchronizer::exit(oopDesc*, JavaThread*)+0x54

Oct 13 '22 08:10 robehn

On aarch64 (linux and mac) I see these variations of crashes in random tests: (asserts in debug, crash in release it looks like)

#  Internal Error .... src/hotspot/share/c1/c1_Runtime1.cpp:768), pid=2884803, tid=2884996
#  assert(oopDesc::is_oop(oop(obj))) failed: must be NULL or an object: 0x000000000000dead
# V  [libjvm.so+0x7851d4]  Runtime1::monitorexit(JavaThread*, oopDesc*)+0x110

#  SIGSEGV (0xb) at pc=0x0000fffc9d4e3de8, pid=1842880, tid=1842994
# V  [libjvm.so+0xbf3de8]  SharedRuntime::monitor_exit_helper(oopDesc*, JavaThread*)+0x24

#  SIGSEGV (0xb) at pc=0x0000fffca9f00394, pid=959883, tid=959927
# V  [libjvm.so+0xc90394]  ObjectSynchronizer::exit(oopDesc*, JavaThread*)+0x54

Ugh. That is most likely caused by the recent change: https://github.com/rkennke/jdk/commit/ebbcb615a788998596f403b47b72cf133cb9de46

It used to be very stable before that. I have backed out that change, can you try again?

Thanks, Roman

Oct 13 '22 10:10 rkennke

That particular benchmark, as some others, exhibit relatively high run-to-run variance. I have run it again many more times to average-out the variance, and I'm now getting the following results: baseline: 3503.844 ms/ops, fast-locking: 3546.344 ms/ops, percent: -1.20%

That is still a slight regression, but with more confidence.

Regarding Lilliput, I cannot really say at the moment. Some workloads are actually regressing with Lilliput, presumably because they are sensitive on the performance of loading the Klass* out of objects, and that is currently more complex in Lilliput (because it needs to coordinate with monitor locking). FinagleHttp seems to be one of those workloads. I am working to get rid of this limitation, and then I can be more specific.

Oct 13 '22 10:10 rkennke

@shipilev : Sure, I am happy to to that! Thanks for porting this to RISC-V :-)

Oct 14 '22 01:10 RealFYang

Seems fine now, thanks.

Update: I had some minor unrelated issues, so I re-ran, now I see one instance of (linux-aarch64):

#  SIGSEGV (0xb) at pc=0x0000fffe21073ce0, pid=4006248, tid=4019428
#
# JRE version: Java(TM) SE Runtime Environment (20.0) (build 20-internal-2022-10-14-0606155.robbin.ehn.dev-jdk)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (20-internal-2022-10-14-0606155.robbin.ehn.dev-jdk, mixed mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-aarch64)
# Problematic frame:
# V  [libjvm.so+0xbf3ce0]  SharedRuntime::complete_monitor_locking_C(oopDesc*, JavaThread*)+0x4c

The tests are still running, when they are done I will re-verify source and re-run a third time.

Can you merge latest head?

Oct 14 '22 06:10 robehn

@shipilev : After applying this on today's jdk master, linux-riscv64 fastdebug fail to build on HiFive Unmatched. I see JVM crash happens during the build process. I suppose you carried out the test with some release build, right?

Oct 14 '22 13:10 RealFYang

Have you applied the whole PR? Or only the patch that @shipilev provided. Because only the patch without the rest of the PR is bound to fail.

Oct 14 '22 14:10 rkennke

Yes, the whole PR: https://patch-diff.githubusercontent.com/raw/openjdk/jdk/pull/10590.diff

Oct 14 '22 14:10 RealFYang

The PR reports a merge conflict in risc-v code, when applied vs latest tip. Have you resolved that? GHA (which includes risc-v) is happy, otherwise.

Oct 14 '22 14:10 rkennke

@rkennke : I did see some "Hunk succeeded" messages for the risc-v part when applying the change with: $ patch -p1 < ~/10590.diff But I didn't check whether that will cause a problem here.

patching file src/hotspot/cpu/riscv/c1_CodeStubs_riscv.cpp
patching file src/hotspot/cpu/riscv/c1_LIRAssembler_riscv.cpp
patching file src/hotspot/cpu/riscv/c1_LIRGenerator_riscv.cpp
patching file src/hotspot/cpu/riscv/c1_MacroAssembler_riscv.cpp
Hunk #1 succeeded at 58 (offset -1 lines).
Hunk #2 succeeded at 67 (offset -1 lines).
patching file src/hotspot/cpu/riscv/c1_Runtime1_riscv.cpp
patching file src/hotspot/cpu/riscv/interp_masm_riscv.cpp
patching file src/hotspot/cpu/riscv/macroAssembler_riscv.cpp
Hunk #1 succeeded at 2499 (offset 324 lines).
Hunk #2 succeeded at 4474 (offset 330 lines).
patching file src/hotspot/cpu/riscv/macroAssembler_riscv.hpp
Hunk #1 succeeded at 869 with fuzz 2 (offset 313 lines).
Hunk #2 succeeded at 1252 (offset 325 lines).
patching file src/hotspot/cpu/riscv/riscv.ad
Hunk #1 succeeded at 2385 (offset 7 lines).
Hunk #2 succeeded at 2407 (offset 7 lines).
Hunk #3 succeeded at 2433 (offset 7 lines).
Hunk #4 succeeded at 10403 (offset 33 lines).
Hunk #5 succeeded at 10417 (offset 33 lines).
patching file src/hotspot/cpu/riscv/sharedRuntime_riscv.cpp
Hunk #1 succeeded at 975 (offset 21 lines).
Hunk #2 succeeded at 1030 (offset 21 lines).
Hunk #3 succeeded at 1042 (offset 21 lines).
Hunk #4 succeeded at 1058 (offset 21 lines).
Hunk #5 succeeded at 1316 (offset 24 lines).
Hunk #6 succeeded at 1416 (offset 24 lines).
Hunk #7 succeeded at 1492 (offset 24 lines).
Hunk #8 succeeded at 1517 (offset 24 lines).
Hunk #9 succeeded at 1621 (offset 24 lines).

Oct 14 '22 14:10 RealFYang

If you take the latest code from this PR, it would already have the patch applied. No need to patch it again.

Oct 14 '22 15:10 rkennke

@rkennke : Could you please add this follow-up fix for RISC-V? I can build fastdebug on HiFive Unmatched board with this fix now and run non-trivial benchmark workloads. I will carry out more tests.

riscv-patch-2.txt

Oct 17 '22 04:10 RealFYang

/contributor fyang

Oct 17 '22 10:10 rkennke

@rkennke this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout fast-locking
git fetch https://git.openjdk.org/jdk master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

Oct 17 '22 10:10 openjdk[bot]

@rkennke Syntax: /contributor (add|remove) [@user | openjdk-user | Full Name <email@address>]. For example:

/contributor add @openjdk-bot
/contributor add duke
/contributor add J. Duke <[email protected]>

User names can only be used for users in the census associated with this repository. For other contributors you need to supply the full name and email address.

Oct 17 '22 10:10 openjdk[bot]

/contributor add fyang

Oct 18 '22 19:10 rkennke

@rkennke Contributor Fei Yang <[email protected]> successfully added.

Oct 18 '22 19:10 openjdk[bot]

First the "SharedRuntime::complete_monitor_locking_C" crash do not reproduce.

Secondly, a question/suggestion: Many recursive cases do not interleave locks, meaning the recursive enter will happen with the lock/oop top of lock stack already. Why not peak at top lock/oop in lock-stack if the is current just push it again and the locking is done? (instead of inflating) (exit would need to check if this is the last one and then proper exit)

Worried about the size of the lock-stack?

Oct 24 '22 11:10 robehn

This PR has been in "merge-conflict" state for about 10 days. When do you plan to merge again with the jdk/jdk repo?

Update: Ignore this. For some reason the "removed merge-conflict" from four days ago didn't show up until refreshed the browser (again)...

Oct 27 '22 19:10 dcubed-ojdk

The CJM paper (Dice/Kogan 2021) mentions a "nesting" counter for this purpose. I suspect that a real counter is overkill, and the "unary" representation Robbin mentions would be fine, especially if there were a point (when the per-thread stack gets too big) at which we go and inflate anyway.

The CJM paper suggests a full search of the per-thread array to detect the recursive condition, but again I like Robbin's idea of checking only the most recent lock record.

So the data structure for lock records (per thread) could consist of a series of distinct values [ A B C ] and each of the values could be repeated, but only adjacently: [ A A A B C C ] for example. And there could be a depth limit as well. Any sequence of held locks not expressible within those limitations could go to inflation as a backup.

Oct 27 '22 20:10 rose00

jdk jdk copied to clipboard

8291555: Replace stack-locking with fast-locking

Benchmarks

DaCapo/AArch64

DaCapo/x86_64

Renaissance/AArch64

Renaissance/x86_64

Testing

Progress

Issue

Contributors

Reviewing

Webrevs

jdk
jdk copied to clipboard