jdk 8291555: Replace stack-locking with fast-locking

This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable.

What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock.

This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does not overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oop that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads owns which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations.

In contrast to stack-locking, fast-locking does not support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking.

One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread.

As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. But alas, such code exists, and we probably don't want to punish it if we can avoid it.

This change enables to simplify (and speed-up!) a lot of code:

The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header.
Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR

Progress

[ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
[x] Change must not contain extraneous whitespace
[x] Commit message must refer to an issue

Issue

JDK-8291555: Replace stack-locking with fast-locking

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/9680/head:pull/9680
$ git checkout pull/9680

Update a local copy of the PR:
$ git checkout pull/9680
$ git pull https://git.openjdk.org/jdk pull/9680/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 9680

View PR using the GUI difftool:
$ git pr show -t 9680

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/9680.diff

Jul 28 '22 19:07 rkennke

:wave: Welcome back rkennke! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

Jul 28 '22 20:07 bridgekeeper[bot]

@rkennke this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout fast-locking
git fetch https://git.openjdk.org/jdk master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

Jul 28 '22 20:07 openjdk[bot]

@rkennke The following labels will be automatically applied to this pull request:

hotspot
serviceability
shenandoah

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

Jul 28 '22 20:07 openjdk[bot]

When I run renaissance philosophers benchmark (no arguments, just the default settings) on my 12 core machine the VM intermittently hangs after the benchmark is done. Always, two threads keep running at 100% CPU.

I have been able to attach gdb once and we were in a tight loop in

(gdb) bt
#0  Atomic::PlatformLoad<8ul>::operator()<void*> (dest=0x7f991c119e80, this=<optimized out>) at src/hotspot/share/runtime/atomic.hpp:614
#1  Atomic::LoadImpl<void*, Atomic::PlatformLoad<8ul>, void>::operator() (dest=0x7f991c119e80, this=<optimized out>) at src/hotspot/share/runtime/atomic.hpp:392
#2  Atomic::load<void*> (dest=0x7f991c119e80) at src/hotspot/share/runtime/atomic.hpp:615
#3  ObjectMonitor::owner_raw (this=0x7f991c119e40) at src/hotspot/share/runtime/objectMonitor.inline.hpp:66
#4  ObjectMonitor::owner (this=0x7f991c119e40) at src/hotspot/share/runtime/objectMonitor.inline.hpp:61
#5  ObjectSynchronizer::monitors_iterate (thread=0x7f9a30027230, closure=<synthetic pointer>) at src/hotspot/share/runtime/synchronizer.cpp:983
#6  ObjectSynchronizer::release_monitors_owned_by_thread (current=current@entry=0x7f9a30027230) at src/hotspot/share/runtime/synchronizer.cpp:1492
#7  0x00007f9a351bc320 in JavaThread::exit (this=this@entry=0x7f9a30027230, destroy_vm=destroy_vm@entry=false, exit_type=exit_type@entry=JavaThread::jni_detach) at src/hotspot/share/runtime/javaThread.cpp:851
#8  0x00007f9a352445ca in jni_DetachCurrentThread (vm=<optimized out>) at src/hotspot/share/prims/jni.cpp:3962
#9  0x00007f9a35f9ac7e in JavaMain (_args=<optimized out>) at src/java.base/share/native/libjli/java.c:555
#10 0x00007f9a35f9e30d in ThreadJavaMain (args=<optimized out>) at src/java.base/unix/native/libjli/java_md.c:650
#11 0x00007f9a35d47609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#12 0x00007f9a35ea3133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

in one thread. Which points to a misformed monitor list. I tried to reproduce it with a debug build, but no such luck. I was able to reproduce it once again with a release build. I'll see if I can find out more.

Aug 02 '22 15:08 tstuefe

Happens when the main thread detaches itself upon VM exit. VM attempts to release OMs that are owned by the finished main thread (side note: if that is the sole surviving thread, maybe that step could be skipped?). That happens before DestroyVM, so OM final audit did not yet run.

Problem here is the OM in use list is circular (and very big, ca 11mio entries).

I was able to reproduce it with a fastdebug build in 1 out of 5-6 runs. Also with less benchmark cycles (-r 3).

Aug 03 '22 07:08 tstuefe

Offlist questions from Roman:

-"Does it really not happen with Stock?" no, I could not reproduce it with stock VM (built from f5d1b5bda27c798347ae278cbf69725ed4be895c, the commit preceding the PR)

-"Do we now have more OMs than before?" I cannot see that effect. Running philosophers with -r 3 causes the VM in the end to have between 800k and ~2mio open OMs if the error does not happen, no difference between stock and PR VM. In cases where the PR-VM hangs we have a lot more, as I wrote, about 11-12mio OMs.

Aug 03 '22 13:08 tstuefe

Happens when the main thread detaches itself upon VM exit. VM attempts to release OMs that are owned by the finished main thread (side note: if that is the sole surviving thread, maybe that step could be skipped?). That happens before DestroyVM, so OM final audit did not yet run.

Problem here is the OM in use list is circular (and very big, ca 11mio entries).

I was able to reproduce it with a fastdebug build in 1 out of 5-6 runs. Also with less benchmark cycles (-r 3).

Hi Thomas, thanks for testing and reporting the issue. I just pushed an improvement (and simplification) of the monitor-enter-inflate path, and cannot seem to reproduce the problem anymore. Can you please try again with the latest change?

Aug 07 '22 12:08 rkennke

Happens when the main thread detaches itself upon VM exit. VM attempts to release OMs that are owned by the finished main thread (side note: if that is the sole surviving thread, maybe that step could be skipped?). That happens before DestroyVM, so OM final audit did not yet run. Problem here is the OM in use list is circular (and very big, ca 11mio entries). I was able to reproduce it with a fastdebug build in 1 out of 5-6 runs. Also with less benchmark cycles (-r 3).

Hi Thomas, thanks for testing and reporting the issue. I just pushed an improvement (and simplification) of the monitor-enter-inflate path, and cannot seem to reproduce the problem anymore. Can you please try again with the latest change?

New version ran for 30 mins without crashing. Not a solid proof, but its better :-)

Aug 07 '22 16:08 tstuefe

The bar for acceptance for a brand new locking scheme with no fallback is extremely high and needs a lot of bake time and broad performance measurements, to watch for pathologies. That bar is lower if the scheme can be reverted to the old code if needed; and even lower still if the scheme is opt-in in the first place. For Java Object Monitors I made the new mechanism opt-in so the same could be done here. Granted it is not a trivial effort to do that, but I think a phased approach to transition to the new scheme is essential. It could be implemented as an experimental feature initially.

Aug 08 '22 12:08 dholmes-ora

The bar for acceptance for a brand new locking scheme with no fallback is extremely high and needs a lot of bake time and broad performance measurements, to watch for pathologies. That bar is lower if the scheme can be reverted to the old code if needed; and even lower still if the scheme is opt-in in the first place. For Java Object Monitors I made the new mechanism opt-in so the same could be done here. Granted it is not a trivial effort to do that, but I think a phased approach to transition to the new scheme is essential. It could be implemented as an experimental feature initially.

Reverting a change should not be difficult. (Unless maybe another major change arrived in the meantime, which makes reverse-applying a patch non-trivial.)

I'm skeptical to implement an opt-in runtime-switch, though.

Keeping the old paths side-by-side with the new paths is an engineering effort in itself, as you point out. It means that it, too, introduces significant risks to break locking, one way or the other (or both).
Making the new path opt-in means that we achieve almost nothing by it: testing code would still normally run the old paths (hopefully we didn't break it by making the change), and only use the new paths when explicitely told so, and I don't expect that many people voluntarily do that. It may be more useful to make it opt-out, as a quick fix if anybody experiences troubles with it.
Do we need runtime-switchable opt-in or opt-out flag for the initial testing and baking? I wouldn't think so: it seems better and cleaner to take the Git branch of this PR and put it through all relevant testing before the change goes in.
For how long do you think the runtime switch should stay? Because if it's all but temporary, it means we better test both paths thoroughly and automated. And it may also mean extra maintenance work (with extra avenues for bugs, see above), too.

Aug 08 '22 13:08 rkennke

The bar for acceptance for a brand new locking scheme with no fallback is extremely high and needs a lot of bake time and broad performance measurements, to watch for pathologies. That bar is lower if the scheme can be reverted to the old code if needed; and even lower still if the scheme is opt-in in the first place. For Java Object Monitors I made the new mechanism opt-in so the same could be done here. Granted it is not a trivial effort to do that, but I think a phased approach to transition to the new scheme is essential. It could be implemented as an experimental feature initially.

I fully agree that have to be careful, but I share Roman's viewpoint. If this work is something we want to happen and which is not in doubt in principle, then we also want the broadest possible test front.

In my experience, opt-in coding is tested poorly. A runtime switch is fine as an emergency measure when you have customer problems, but then both standard and fallback code paths need to be very well tested. With something as ubiquitous as locking this would mean running almost the full test set with and without the new fast locking mechanism, and that is not feasible. Or even if it is, not practical: the cycles are better invested in hardening out the new locking mechanism.

And arguably, we already have an opt-out mechanism in the form of UseHeavyMonitors. It's not ideal, but as Roman wrote, in most scenarios, this does not show any regression. So in a pinch, it could serve as a short-term solution if the new fast lock mechanism is broken.

In my opinion, the best time for such an invasive change is the beginning of the development cycle for a non-LTS-release, like now. And we don't have to push the PR in a rush, we can cook it in its branch and review it very thoroughly.

Cheers, Thomas

Aug 08 '22 15:08 tstuefe

I ran some test locally, 4 JDI fails and 3 JVM TI, all seems to fail in:

#7  0x00007f7cefc5c1ce in Thread::is_lock_owned (this=this@entry=0x7f7ce801dd90, adr=adr@entry=0x1 <error: Cannot access memory at address 0x1>) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/runtime/thread.cpp:549
#8  0x00007f7cef22c062 in JavaThread::is_lock_owned (this=0x7f7ce801dd90, adr=0x1 <error: Cannot access memory at address 0x1>) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/runtime/javaThread.cpp:979
#9  0x00007f7cefc79ab0 in Threads::owning_thread_from_monitor_owner (t_list=<optimized out>, owner=owner@entry=0x1 <error: Cannot access memory at address 0x1>)
    at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/runtime/threads.cpp:1382

Aug 08 '22 15:08 robehn

I ran some test locally, 4 JDI fails and 3 JVM TI, all seems to fail in:

#7  0x00007f7cefc5c1ce in Thread::is_lock_owned (this=this@entry=0x7f7ce801dd90, adr=adr@entry=0x1 <error: Cannot access memory at address 0x1>) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/runtime/thread.cpp:549
#8  0x00007f7cef22c062 in JavaThread::is_lock_owned (this=0x7f7ce801dd90, adr=0x1 <error: Cannot access memory at address 0x1>) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/runtime/javaThread.cpp:979
#9  0x00007f7cefc79ab0 in Threads::owning_thread_from_monitor_owner (t_list=<optimized out>, owner=owner@entry=0x1 <error: Cannot access memory at address 0x1>)
    at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/runtime/threads.cpp:1382

Thanks, Robbin! That was a bug in JvmtiBase::get_owning_thread() where an anonymous owner must be converted to the oop address before passing down to Threads::owning_thread_from_monitor_owner(). I pushed a fix. Can you re-test? Testing com/sun/jdi passes for me, now.

Aug 08 '22 18:08 rkennke

I am not aware, please refresh my memory if you know different, of any core hotspot subsystem just being replaced in one fell swoop in one single release. Yes this needs a lot of testing but customers are not beta-testers. If this goes into a release on by default then there must be a way for customers to turn it off. UseHeavyMonitors is not a fallback as it is not for production use itself. So the new code has to co-exist along-side the old code as we make a transition across 2-3 releases. And yes that means a double-up on some testing as we already do for many things.

Aug 09 '22 05:08 dholmes-ora

I ran some test locally, 4 JDI fails and 3 JVM TI, all seems to fail in:
#7  0x00007f7cefc5c1ce in Thread::is_lock_owned (this=this@entry=0x7f7ce801dd90, adr=adr@entry=0x1 <error: Cannot access memory at address 0x1>) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/runtime/thread.cpp:549
#8  0x00007f7cef22c062 in JavaThread::is_lock_owned (this=0x7f7ce801dd90, adr=0x1 <error: Cannot access memory at address 0x1>) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/runtime/javaThread.cpp:979
#9  0x00007f7cefc79ab0 in Threads::owning_thread_from_monitor_owner (t_list=<optimized out>, owner=owner@entry=0x1 <error: Cannot access memory at address 0x1>)
    at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/runtime/threads.cpp:1382
Thanks, Robbin! That was a bug in JvmtiBase::get_owning_thread() where an anonymous owner must be converted to the oop address before passing down to Threads::owning_thread_from_monitor_owner(). I pushed a fix. Can you re-test? Testing com/sun/jdi passes for me, now.

Yes, that fixed it. I'm running more tests also.

I got this build problem on aarch64:

open/src/hotspot/share/asm/assembler.hpp:168), pid=3387376, tid=3387431
#  assert(is_bound() || is_unused()) failed: Label was never bound to a location, but it was used as a jmp target

V  [libjvm.so+0x4f4788]  Label::~Label()+0x48
V  [libjvm.so+0x424a44]  cmpFastLockNode::emit(CodeBuffer&, PhaseRegAlloc*) const+0x764
V  [libjvm.so+0x1643888]  PhaseOutput::fill_buffer(CodeBuffer*, unsigned int*)+0x538
V  [libjvm.so+0xa85fcc]  Compile::Code_Gen()+0x3bc

Aug 09 '22 09:08 robehn

I am not aware, please refresh my memory if you know different, of any core hotspot subsystem just being replaced in one fell swoop in one single release. Yes this needs a lot of testing but customers are not beta-testers. If this goes into a release on by default then there must be a way for customers to turn it off. UseHeavyMonitors is not a fallback as it is not for production use itself. So the new code has to co-exist along-side the old code as we make a transition across 2-3 releases. And yes that means a double-up on some testing as we already do for many things.

I believe the least risky path overall is to make UseHeavyMonitors a production flag. Then it can act as a kill-switch for the new locking code, should anything go bad. I even considered to remove stack-locking altogether, and could only show minor performance impact, and always only in code that uses obsolete synchronized Java collections like Vector, Stack and StringBuffer. If you'd argue that it's too risky to use UseHeavyMonitors for that - then certainly you understand that the risk of introducing a new flag and manage two stack-locking subsystems would be even higher.

There's a lot of code that is risky in itself to keep both paths. For example, I needed to change register allocation in the C2 .ad declarations and also in the interpreter/generated assembly code. It's hard enough to see that it is correct for one of the implementations, and much harder to implement and verify this correctly for two.

Aug 09 '22 09:08 rkennke

I got this build problem on aarch64:

Thanks for giving this PR a spin. I pushed a fix for the aarch64 build problem (seems weird that GHA did not catch it).

Aug 09 '22 10:08 rkennke

Thanks for giving this PR a spin. I pushed a fix for the aarch64 build problem (seems weird that GHA did not catch it).

NP, thanks. I notice some other user of owning_thread_from_monitor_owner() such as DeadlockCycle::print_on_with() which asserts on "assert(adr != reinterpret_cast<void*>(1)) failed: must convert to lock object".

Aug 09 '22 11:08 robehn

I am not aware, please refresh my memory if you know different, of any core hotspot subsystem just being replaced in one fell swoop in one single release. Yes this needs a lot of testing but customers are not beta-testers. If this goes into a release on by default then there must be a way for customers to turn it off. UseHeavyMonitors is not a fallback as it is not for production use itself. So the new code has to co-exist along-side the old code as we make a transition across 2-3 releases. And yes that means a double-up on some testing as we already do for many things.

Maybe it's worth to step back a little and discuss whether or not we actually want stack-locking (or a replacement) at all. My measurements seem to indicate that a majority of modern workloads (i.e. properly synchronized, not using legacy collections) actually benefit from running without stack-locking (or the fast-locking replacement). The workloads that suffer seem to be only such workloads which make heavy use of always-synchronized collections, code that we'd nowadays probably not consider 'idiomatic Java' anymore. This means that support for faster legacy code costs modern Java code actual performance points. Do we really want this? It may be wiser overall to simply drop stack-locking without replacement, and go and fix the identified locations where using of legacy collections affects performance negatively in the JDK (I found a few places in XML/XSLT code, for example).

I am currently re-running my benchmarks to show this.

Aug 09 '22 11:08 rkennke

Any fast locking scheme benefits the uncontended sync case. So if you have a lot of contention and therefore a lot of inflation, the fast locking won't show any benefit. What "modern workloads" are you using to measure this? We eventually got rid of biased-locking because it no longer showed any benefit, so it is possible that fast locking (of whichever form) could go the same way. And we may have moved past heavy use of synchronized in general for that matter, especially as Loom instigated many changes over to java.util.concurrent locks.

Is UseHeavyMonitors in good enough shape to reliably be used for benchmark comparisons?

Aug 09 '22 11:08 dholmes-ora

Any fast locking scheme benefits the uncontended sync case. So if you have a lot of contention and therefore a lot of inflation, the fast locking won't show any benefit.

Not only that. As far as I can tell, 'heavy monitors' would only be worse off in workloads that 1. use uncontended sync and 2. churns monitors. Lots of uncontended sync on the same monitor object is not actually worse than fast-locking (it boils down to a single CAS in both cases). It only gets bad when code keeps allocating short-lived objects and syncs on them once or a few times only, and then moves on to the next new sync objects.

What "modern workloads" are you using to measure this?

So far I tested with SPECjbb and SPECjvm-workloads-transplanted-into-JMH, dacapo and renaissance. I could only measure regressions with heavy monitors in workloads that use XML/XSLT, which I found out is because the XSTL compiler generates code that uses StringBuffer for (single-threaded) parsing. I also found a few other places in XML where usage of Stack and Vector has some impact. I can provide fixes for those, if needed (but I'm not sure whether this should go into JDK, upstream Xalan/Xerxes or both).

We eventually got rid of biased-locking because it no longer showed any benefit, so it is possible that fast locking (of whichever form) could go the same way. And we may have moved past heavy use of synchronized in general for that matter, especially as Loom instigated many changes over to java.util.concurrent locks.

Yup.

Is UseHeavyMonitors in good enough shape to reliably be used for benchmark comparisons?

Yes, except that the flag would have to be made product. Also, it is useful to use this PR instead of upstream JDK, because it simplifies the inflation protocol pretty much like it would be simplified without any stack-locking. I can make a standalone PR that gets rid of stack-locking altogether, if that is useful.

Also keep in mind that both this fast-locking PR and total removal of stack-locking would enable some follow-up improvements: we'd no longer have to inflate monitors in order to install or read an i-hashcode. And GC code similarily may benefit from easier read/write of object age bits. This might benefit generational concurrent GC efforts.

Aug 09 '22 11:08 rkennke

Thanks for giving this PR a spin. I pushed a fix for the aarch64 build problem (seems weird that GHA did not catch it).

NP, thanks. I notice some other user of owning_thread_from_monitor_owner() such as DeadlockCycle::print_on_with() which asserts on "assert(adr != reinterpret_cast<void*>(1)) failed: must convert to lock object".

Do you know by any chance which tests trigger this?

Aug 11 '22 11:08 rkennke

Thanks for giving this PR a spin. I pushed a fix for the aarch64 build problem (seems weird that GHA did not catch it).

NP, thanks. I notice some other user of owning_thread_from_monitor_owner() such as DeadlockCycle::print_on_with() which asserts on "assert(adr != reinterpret_cast<void*>(1)) failed: must convert to lock object".

Do you know by any chance which tests trigger this?

Yes, there is a couple of to choose from, I think the jstack cmd may be easiest: jstack/DeadlockDetectionTest.java

Aug 11 '22 11:08 robehn

I added implementation for arm, ppc and s390 blindly. @shipilev, @tstuefe maybe you could sanity-check them? most likely they are buggy. I also haven't checked riscv at all, yet.

Aug 12 '22 20:08 rkennke

Thanks for giving this PR a spin. I pushed a fix for the aarch64 build problem (seems weird that GHA did not catch it).

NP, thanks. I notice some other user of owning_thread_from_monitor_owner() such as DeadlockCycle::print_on_with() which asserts on "assert(adr != reinterpret_cast<void*>(1)) failed: must convert to lock object".

Do you know by any chance which tests trigger this?

Yes, there is a couple of to choose from, I think the jstack cmd may be easiest: jstack/DeadlockDetectionTest.java

I pushed a refactoring and fixes to the relevant code, and all users should now work correctly. It's passing test tiers1-3 and tier4 is running while I write this.

Aug 12 '22 20:08 rkennke

@robehn or @dholmes-ora I believe one of you mentioned somewhere (can't find the comment, though) that we might need to support the bytecode sequence monitorenter A; monitorenter B; monitorexit A; monitorexit B; properly. I have now made a testcase that checks this, and it does indeed fail with this PR, while passing with upstream. Also, the JVM spec doesn't mention anywhere that it is required that monitorenter/exit are properly nested. I'll have to fix this in the interpreter (JIT compilers refuse to compile not-properly-nested monitorenter/exit anyway).

See https://github.com/rkennke/jdk/blob/fast-locking/test/hotspot/jtreg/runtime/locking/TestUnstructuredLocking.jasm

Aug 16 '22 13:08 rkennke

@robehn or @dholmes-ora I believe one of you mentioned somewhere (can't find the comment, though) that we might need to support the bytecode sequence monitorenter A; monitorenter B; monitorexit A; monitorexit B; properly. I have now made a testcase that checks this, and it does indeed fail with this PR, while passing with upstream. Also, the JVM spec doesn't mention anywhere that it is required that monitorenter/exit are properly nested. I'll have to fix this in the interpreter (JIT compilers refuse to compile not-properly-nested monitorenter/exit anyway).

See https://github.com/rkennke/jdk/blob/fast-locking/test/hotspot/jtreg/runtime/locking/TestUnstructuredLocking.jasm

jvms-2.11.10

Structured locking is the situation when, during a method invocation, every exit on a given monitor matches a preceding entry on that monitor. Since there is no assurance that all code submitted to the Java Virtual Machine will perform structured locking, implementations of the Java Virtual Machine are permitted but not required to enforce both of the following two rules guaranteeing structured locking. Let T be a thread and M be a monitor. Then:

The number of monitor entries performed by T on M during a method invocation must equal the number of monitor exits performed by T on M during the method invocation whether the method invocation completes normally or abruptly.

At no point during a method invocation may the number of monitor exits performed by T on M since the method invocation exceed the number of monitor entries performed by T on M since the method invocation.

Note that the monitor entry and exit automatically performed by the Java Virtual Machine when invoking a synchronized method are considered to occur during the calling method's invocation.

I think the intent of above was to allow enforcing structured locking.

In relevant other projects, we support only structured locking in Java, but permit some unstructured locking when done via JNI. In that project JNI monitor enter/exit do not use the lockstack.

I don't think we today fully support unstructured locking either:

void foo_lock() {
  monitorenter(this);
  // If VM abruptly returns here 'this' will be unlocked
  // Because VM assumes structured locking.
  // see e.g. remove_activation(...)
}

I scratch this as it was a bit off topic.

Aug 16 '22 15:08 robehn

@robehn or @dholmes-ora I believe one of you mentioned somewhere (can't find the comment, though) that we might need to support the bytecode sequence monitorenter A; monitorenter B; monitorexit A; monitorexit B; properly. I have now made a testcase that checks this, and it does indeed fail with this PR, while passing with upstream. Also, the JVM spec doesn't mention anywhere that it is required that monitorenter/exit are properly nested. I'll have to fix this in the interpreter (JIT compilers refuse to compile not-properly-nested monitorenter/exit anyway). See https://github.com/rkennke/jdk/blob/fast-locking/test/hotspot/jtreg/runtime/locking/TestUnstructuredLocking.jasm

jvms-2.11.10

Structured locking is the situation when, during a method invocation, every exit on a given monitor matches a preceding entry on that monitor. Since there is no assurance that all code submitted to the Java Virtual Machine will perform structured locking, implementations of the Java Virtual Machine are permitted but not required to enforce both of the following two rules guaranteeing structured locking. Let T be a thread and M be a monitor. Then: The number of monitor entries performed by T on M during a method invocation must equal the number of monitor exits performed by T on M during the method invocation whether the method invocation completes normally or abruptly. At no point during a method invocation may the number of monitor exits performed by T on M since the method invocation exceed the number of monitor entries performed by T on M since the method invocation. Note that the monitor entry and exit automatically performed by the Java Virtual Machine when invoking a synchronized method are considered to occur during the calling method's invocation.

I think the intent of above was to allow enforcing structured locking.

TBH, I don't see how this affects the scenario that I'm testing. The scenario: monitorenter A; monitorenter B; monitorexit A; monitorexit B;

violates any of the two conditions:

the number of monitorenters and -exits during the execution always matches
the number of monitorexits for each monitor does not exceed the number of monitorenters for the same monitor

Strictly speaking, I believe the conditions check for the (weaker) balanced property, but not for the (stronger) structured property.

In relevant other projects, we support only structured locking in Java, but permit some unstructured locking when done via JNI. In that project JNI monitor enter/exit do not use the lockstack.

Yeah, JNI locking always inflate and uses full monitors. My proposal hasn't changed this.

I don't think we today fully support unstructured locking either:
void foo_lock() {
  monitorenter(this);
  // If VM abruptly returns here 'this' will be unlocked
  // Because VM assumes structured locking.
  // see e.g. remove_activation(...)
}
I scratch this as it was a bit off topic.

Hmm yeah, this is required for properly handling exceptions. I have seen this making a bit of a mess in C1 code.

That said, unstructured locking today only ever works in the interpreter, the JIT compilers would refuse to compile unstructured locking code. So if somebody would come up with a language and compiler that emits unstructured (e.g. hand-over-hand) locks, it would run, but only very slowly.

I think I know how to make my proposal handle unstructured locking properly: In the interpreter monitorexit, I can check the top of the lock-stack, and if it doesn't match, call into the runtime, and there it's easy to implement the unstructured scenario.

Aug 16 '22 16:08 rkennke

Strictly speaking, I believe the conditions check for the (weaker) balanced property, but not for the (stronger) structured property.

I know but the text says:

"every exit on a given monitor matches a preceding entry on that monitor."
"implementations of the Java Virtual Machine are permitted but not required to enforce both of the following two rules guaranteeing structured locking"

I read this as if the rules do not guarantee structured locking the rules are not correct. The VM is allowed to enforce it. But thats just my take on it.

EDIT: Maybe I'm reading to much into it. Lock A,B then unlock A,B maybe is considered structured locking?

But then again what if:

void foo_lock() {
  monitorenter(A);
  monitorenter(B);
  // If VM abruptly returns here
  // VM can unlock them in reverse order first B and then A ?
  monitorexit(A);
  monitorexit(B);
}

Aug 17 '22 07:08 robehn

Strictly speaking, I believe the conditions check for the (weaker) balanced property, but not for the (stronger) structured property.

I know but the text says:
* "every exit on a given monitor matches a preceding entry on that monitor."

* "implementations of the Java Virtual Machine are permitted but not required to enforce both of the following two rules guaranteeing structured locking"
I read this as if the rules do not guarantee structured locking the rules are not correct. The VM is allowed to enforce it. But thats just my take on it.

EDIT: Maybe I'm reading to much into it. Lock A,B then unlock A,B maybe is considered structured locking?

But then again what if:
void foo_lock() {
  monitorenter(A);
  monitorenter(B);
  // If VM abruptly returns here
  // VM can unlock them in reverse order first B and then A ?
  monitorexit(A);
  monitorexit(B);
}

Do you think there would be any chance to clarify the spec there? Or even outright disallow unstructured/not-properly-nested locking altogether (and maybe allow the verifier to check it)? That would certainly be the right thing to do. And, afaict, it would do no harm because no compiler of any language would ever emit unstructured locking anyway - because if it did, the resulting code would crawl interpreted-only).

Aug 17 '22 15:08 rkennke

jdk jdk copied to clipboard

8291555: Replace stack-locking with fast-locking

Progress

Issue

Reviewing

jdk
jdk copied to clipboard