jdk icon indicating copy to clipboard operation
jdk copied to clipboard

8366659: ObjectMonitor::wait() can deadlock with a suspension request

Open toxaart opened this issue 4 months ago • 33 comments

Hi, please consider the following changes:

If suspension is allowed when a thread is re-entering an object monitor (OM), then a deadlock is possible:

The waiting thread is made to be a successor and is unparked. Upon a suspension request, the thread will suspend itself whilst clearing the successor. The OM will be left unlocked (not grabbed by any thread), while the other threads are parked until a thread grabs the OM and the exits it. The suspended thread is on the entry-list and can be selected as a successor again. None of other threads can be woken up to grab the OM until the suspended thread has been resumed and successfully releases the OM.

This can happen in two places where the successor could be suspended: 1: https://github.com/openjdk/jdk/blob/6322aaba63b235cb6c73d23a932210af318404ec/src/hotspot/share/runtime/objectMonitor.cpp#L1897

2: https://github.com/openjdk/jdk/blob/6322aaba63b235cb6c73d23a932210af318404ec/src/hotspot/share/runtime/objectMonitor.cpp#L1149

The issues are addressed by not allowing suspension in case 1, and by handling the suspension request at a later stage, after the thread has grabbed the OM in reenter_internal() in case 2. In case of a suspension request, the thread exits the OM and enters it again once resumed.

The JVMTI waited event posting (2nd one) is postponed until the suspended thread is resumed and has entered the OM again. The enter to the OM (in case ExitOnSuspend did exit) is done without posting any events.

Tests are added for both scenarios.

Tested in tiers 1 - 7.


Progress

  • [ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • [x] Change must not contain extraneous whitespace
  • [x] Commit message must refer to an issue

Issue

  • JDK-8366659: ObjectMonitor::wait() can deadlock with a suspension request (Bug - P4)

Reviewers

Contributors

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/27040/head:pull/27040
$ git checkout pull/27040

Update a local copy of the PR:
$ git checkout pull/27040
$ git pull https://git.openjdk.org/jdk.git pull/27040/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 27040

View PR using the GUI difftool:
$ git pr show -t 27040

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/27040.diff

Using Webrev

Link to Webrev Comment

toxaart avatar Sep 02 '25 08:09 toxaart

:wave: Welcome back toxaart! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

bridgekeeper[bot] avatar Sep 02 '25 08:09 bridgekeeper[bot]

@toxaart This change is no longer ready for integration - check the PR body for details.

openjdk[bot] avatar Sep 02 '25 08:09 openjdk[bot]

@toxaart The following labels will be automatically applied to this pull request:

  • hotspot-runtime
  • serviceability

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

openjdk[bot] avatar Sep 02 '25 08:09 openjdk[bot]

/contributor add @pchilano

toxaart avatar Sep 02 '25 08:09 toxaart

@toxaart Contributor Patricio Chilano Mateo <[email protected]> successfully added.

openjdk[bot] avatar Sep 02 '25 08:09 openjdk[bot]

The transaction diagram in SuspendWithObjectMonitorWait.java on L56 -> L77 is for the doWork1 test so the comment should be modifed to make that clear by adding this above L56:

//
// doWork1 algorithm:

dcubed-ojdk avatar Nov 13 '25 20:11 dcubed-ojdk

I've created a transaction diagram for doWork2:

//
// doWork2 algorithm:
//
// main               waiter              resumer
// =================  ==================  ===================
// launch waiter
// <launch returns>   waiter running
// launch resumer     enter threadLock
// <launch returns>   threadLock.wait()   resumer running
// enter threadLock   :                   wait for notify
// threadLock.notify  wait finishes       :
// :                  reenter blocks      :
// suspend waiter     <suspended>         :
// <ready to test>    :                   :
// :                  :                   :
// notify resumer     :                   wait finishes
// delay 1-second     :                   :
// exit threadLock    :                   :
// join resumer       :                   enter threadLock
// :                  <resumed>           resume waiter
// :                  :                   exit threadLock
// :                  reenter threadLock  :
// <join returns>     :                   resumer exits
// join waiter        :
// <join returns>     waiter exits
//
// Note: The sleep(1-second) in main along with the delayed exit
//       of threadLock in main forces the resumer thread to reach
//       "enter threadLock" and block. This difference from doWork1
//       forces the resumer thread to be contending for threadLock
//       while the waiter thread is in threadLock.wait() increasing
//       stress on the monitor sub-system.
//

dcubed-ojdk avatar Nov 13 '25 20:11 dcubed-ojdk

I've created a transaction diagram for doWork3:

//
// doWork3 algorithm:
//
// main                 waiter                  resumer
// ===================  ======================  ===================
// launch waiter
// <launch returns>     waiter running
// launch resumer       enter threadLock
// <launch returns>     while !READY_TO_NOTIFY  resumer running
// delay 1-second         threadLock.wait(1)    wait for notify
// enter threadLock     :                       :
// set READY_TO_NOTIFY  :
// threadLock.notify    wait finishes           :
// :                    reenter blocks          :
// suspend waiter       <suspended>             :
// <ready to test>      :                       :
// :                    :                       :
// notify resumer       :                       wait finishes
// delay 1-second       :                       :
// exit threadLock      :                       :
// join resumer         :                       enter threadLock
// :                    <resumed>               resume waiter
// :                    :                       exit threadLock
// :                    reenter threadLock      :
// <join returns>       :                       resumer exits
// join waiter          :
// <join returns>       waiter exits
//
// Note: The sleep(1-second) in main along with the delayed exit
//       of threadLock in main forces the resumer thread to reach
//       "enter threadLock" and block. This difference from doWork1
//       forces the resumer thread to be contending for threadLock
//       while the waiter thread is in the threadLock.wait(1) tight
//       loop increasing stress on the monitor sub-system.
//
// Note: The first sleep(1-second) in main and the wait(1) in the waiter
//       thread allows the waiter thread to loop tightly here:
//         while !READY_TO_NOTIFY
//           threadLock.wait(1)
//

dcubed-ojdk avatar Nov 13 '25 20:11 dcubed-ojdk

So the bug report talks about two different deadlocks and we have added two new test cases to SuspendWithObjectMonitorWait.java.

I think the new doWork2 test case is added to catch deadlock-1 when we have a suspended thread made the successor over and over again so that none of the other contending threads ever get the monitor even though it is unlocked.

I think the new doWork3 test case is added to catch deadlock-2 where the waiting thread manages to re-enter the monitor and then gets suspended while on its way back to Java.

@toxaart and/or @pchilano - Please verify my understanding of this mapping from the two new test cases to the two deadlocks. Thanks!

dcubed-ojdk avatar Nov 13 '25 20:11 dcubed-ojdk

So the bug report talks about two different deadlocks and we have added two new test cases to SuspendWithObjectMonitorWait.java.

I think the new doWork2 test case is added to catch deadlock-1 when we have a suspended thread made the successor over and over again so that none of the other contending threads ever get the monitor even though it is unlocked.

I think the new doWork3 test case is added to catch deadlock-2 where the waiting thread manages to re-enter the monitor and then gets suspended while on its way back to Java.

@toxaart and/or @pchilano - Please verify my understanding of this mapping from the two new test cases to the two deadlocks. Thanks!

The PR description of the cases isn’t quite accurate. The explanation in 1. applies to both deadlocks, the only difference is where the successor could be suspended: [1] https://github.com/openjdk/jdk/blob/6322aaba63b235cb6c73d23a932210af318404ec/src/hotspot/share/runtime/objectMonitor.cpp#L1897 (exercised by doWork2) [2] https://github.com/openjdk/jdk/blob/6322aaba63b235cb6c73d23a932210af318404ec/src/hotspot/share/runtime/objectMonitor.cpp#L1149 (exercised by doWork3)

pchilano avatar Nov 13 '25 21:11 pchilano

Thanks @dcubed-ojdk and @pchilano, I addressed all the points. The diagrams are now added to the test. The PR description is updated as well.

toxaart avatar Nov 14 '25 11:11 toxaart

/contributor add @dcubed-ojdk

toxaart avatar Nov 17 '25 13:11 toxaart

@toxaart Contributor Daniel D. Daugherty <[email protected]> successfully added.

openjdk[bot] avatar Nov 17 '25 13:11 openjdk[bot]

@sspitsyn the proposed changes do have an effect on the order of jvmti events posted. For instance, in the timed-out case:

Master branch, events posted: wait -> waited -> contended enter -> contended entered This PR: events posted: wait -> contended enter -> contended entered -> waited

I have checked the specs, and it is not really clear: is (re)entering the OM considered to be a part of wait or not?

toxaart avatar Nov 18 '25 14:11 toxaart

Thank you for refactoring the test into several tests which share a common part. It is nice! May I ask you about more refactoring? I'll inline my comments with the refactoring suggestions.

sspitsyn avatar Nov 19 '25 04:11 sspitsyn

@sspitsyn the proposed changes do have an effect on the order of jvmti events posted. For instance, in the timed-out case:

Master branch, events posted: wait -> waited -> contended enter -> contended entered This PR: events posted: wait -> contended enter -> contended entered -> waited

I have checked the specs, and it is not really clear: is (re)entering the OM considered to be a part of wait or not?

@sspitsyn could you please comment on the order of events and possible issues we can get if the order is changed?

toxaart avatar Nov 19 '25 15:11 toxaart

@sspitsyn the proposed changes do have an effect on the order of jvmti events posted. For instance, in the timed-out case: Master branch, events posted: wait -> waited -> contended enter -> contended entered This PR: events posted: wait -> contended enter -> contended entered -> waited

I have checked the specs, and it is not really clear: is (re)entering the OM considered to be a part of wait or not?

I also do not see anything in the specs (JLS, JVMS and JVMTI) but I feel it is better to keep the original order if possible. It is going to be an impact/change on the thread state: https://docs.oracle.com/en/java/javase/25/docs/specs/jvmti.html#GetThreadState

AFAIU, with this change new combination of thread state bits will be observable: JVMTI_THREAD_STATE_WAITING + JVMTI_THREAD_STATE_BLOCKED_ON_MONITOR_ENTER + (JVMTI_THREAD_STATE_WAITING_INDEFINITELY | JVMTI_THREAD_STATE_WAITING_WITH_TIMEOUT)

Also, it is a change in behavior with a compatibility risk involved. So, a CSR will be needed for this change. @dholmes-ora may have some concerns here.

sspitsyn avatar Nov 19 '25 19:11 sspitsyn

@lmesnik are the JVMTI thread state transitions performed in the event posting code? If so the different order, and thus different states, would be a concern. That said we have noticed that only the timeout case seems to operate in a way that the monitor reentry can post contended_enter and contended_entered events, which seems very odd in itself. Though I will also note that the way the suspension code drops and then re-acquires the monitor, any contended_enter* events could get posted multiple times which would also be surprising.

Maintaining the event order is problematic as, with the fix to the deadlock issue, we only allow suspension during reentry, and that would mean the monitor_waited event would be posted whilst the thread is still suspended. To be clear(er) in the old code we would do:

wait -> suspend if needed -> resumed -> post monitor_waited -> renter monitor

whereas the new code, if we kept the placement of the event, would do

wait -> post monitor_waited -> reenter monitor -> suspend if needed and drop monitor -> resumed -> reenter monitor

and what the proposed code actually does is

wait -> reenter monitor -> suspend if needed and drop monitor -> resumed -> reenter monitor -> post monitor_waited 

Given the lack of specificity in the JVM TI spec around these different events I think any assumptions in a TCK test could be challenged, but it would still be a change in behaviour.

dholmes-ora avatar Nov 20 '25 05:11 dholmes-ora

@lmesnik are the JVMTI thread state transitions performed in the event posting code? If so the different order, and thus different states, would be a concern. That said we have noticed that only the timeout case seems to operate in a way that the monitor reentry can post contended_enter and contended_entered events, which seems very odd in itself. Though I will also note that the way the suspension code drops and then re-acquires the monitor, any contended_enter* events could get posted multiple times which would also be surprising.

I think this comment was supposed to be addressed to @sspitsyn

toxaart avatar Nov 20 '25 09:11 toxaart

/help

toxaart avatar Nov 20 '25 12:11 toxaart

/touch

toxaart avatar Nov 20 '25 12:11 toxaart

@toxaart Available commands:

  • approval - request for maintainer's approval
  • approve - null
  • author - sets an overriding author to be used in the commit when the PR is integrated
  • backport - create a backport
  • cc - add or remove an additional classification label
  • clean - Mark the backport pull request as a clean backport
  • contributor - adds or removes additional contributors for a PR
  • covered - used when employer has signed the OCA
  • csr - require a compatibility and specification request (CSR) for this pull request
  • help - shows this text
  • integrate - performs integration of the changes in the PR
  • issue - edit the list of issues that this PR solves
  • jep - require a JDK Enhancement Proposal (JEP) for this pull request
  • keepalive - Re-evaluates the pull request and resets the inactivity timeout.
  • label - add or remove an additional classification label
  • open - Set the pull request state to "open"
  • reviewer - manage additional reviewers for a PR
  • reviewers - set the number of additional required reviewers for this PR
  • signed - used after signing the OCA
  • solves - edit the list of issues that this PR solves
  • sponsor - performs integration of a PR that is authored by a non-committer
  • summary - updates the summary in the commit message
  • test - used to run tests
  • touch - Re-evaluates the pull request and resets the inactivity timeout.

openjdk[bot] avatar Nov 20 '25 13:11 openjdk[bot]

@toxaart The pull request is being re-evaluated and the inactivity timeout has been reset.

openjdk[bot] avatar Nov 20 '25 13:11 openjdk[bot]

Maintaining the event order is problematic as, with the fix to the deadlock issue, we only allow suspension during reentry, and that would mean the monitor_waited event would be posted whilst the thread is still suspended. To be clear(er) in the old code we would do:

wait -> suspend if needed -> resumed -> post monitor_waited -> renter monitor

whereas the new code, if we kept the placement of the event, would do

wait -> post monitor_waited -> reenter monitor -> suspend if needed and drop monitor -> resumed -> reenter monitor

and what the proposed code actually does is

wait -> reenter monitor -> suspend if needed and drop monitor -> resumed -> reenter monitor -> post monitor_waited 

Given the lack of specificity in the JVM TI spec around these different events I think any assumptions in a TCK test could be challenged, but it would still be a change in behaviour.

@dholmes-ora I have done a bit of refactoring and introduced a private method post_waited_event(), which we can call in different places depending on the scenario.

For instance, for the timed-out case, the sequence of events is now unchanged, i.e.: wait -> waited -> contended enter -> contended entered

For the notified case, using your extended notation, it is now behaving like this: wait -> reenter monitor -> suspend if needed and drop monitor -> resumed -> post monitor_waited -> reenter monitor

I think this is what we eventually want. @sspitsyn could you confirm/refute that it does not change thread state bits? The JVMTI tests pass, the extended testing is in progress.

toxaart avatar Nov 20 '25 15:11 toxaart

@dholmes-ora David, thank you for your input!

@dholmes-ora I have done a bit of refactoring and introduced a private method post_waited_event(), which we can call in different places depending on the scenario.

I like the refactoring.

For instance, for the timed-out case, the sequence of events is now unchanged, i.e.: wait -> waited -> contended enter -> contended entered

For the notified case, using your extended notation, it is now behaving like this: wait -> reenter monitor -> suspend if needed and drop monitor -> resumed -> post monitor_waited -> reenter monitor

My understanding is that we need both timed-out and notified cases to do the same as David suggested above: wait -> post monitor_waited -> reenter monitor -> suspend if needed and drop monitor -> resumed -> reenter monitor It would give us two advantages:

  • timed-out and notified cases are unified
  • avoid incompatibility risk, e.g.:
    • the object monitor events order is kept same as was before (is it true?)

Is there a problem to make it unified? I'm just thinking if there is a way to fix this in a better way. Please, let me think on this a little bit more.

I think this is what we eventually want. @sspitsyn could you confirm/refute that it does not change thread state bits?

I'm looking at the OM code. It looks like I can be wrong that it introduces new combinations of thread state bits. in your new tests we could print the thread states returned by JVMTI GetThreadState called from MonitorContendedEnter and MonitorWaited callbacks. It is interesting to ensure the state bits with your fix are the same as before.

sspitsyn avatar Nov 20 '25 23:11 sspitsyn

@lmesnik are the JVMTI thread state transitions performed in the event posting code? If so the different order, and thus different states, would be a concern.

Yes, JVMTI event posting code performs thread state transitions. But I was talking about the JVMTI thread state bits specified here: https://docs.oracle.com/en/java/javase/25/docs/specs/jvmti.html#GetThreadState

That said we have noticed that only the timeout case seems to operate in a way that the monitor reentry can post contended_enter and contended_entered events, which seems very odd in itself.

Right. We have this oddity.

Though I will also note that the way the suspension code drops and then re-acquires the monitor, any contended_enter* events could get posted multiple times which would also be surprising.

Right. There have to be some ways to fix it. Most likely we already have a bug against this issue.

Maintaining the event order is problematic as, with the fix to the deadlock issue, we only allow suspension during reentry, and that would mean the monitor_waited event would be posted whilst the thread is still suspended.

Sorry, I'm not sure I fully understand this: "would mean the monitor_waited event would be posted whilst the thread is still suspended." I understand the problematic part in general. The issue is that the post_monitor_waited() has a suspend point in a back transition from Native to Java.

sspitsyn avatar Nov 21 '25 00:11 sspitsyn

I'm reading this bug description in JBS:

If suspension is allowed when a thread is re-entering an object monitor (OM), then a deadlock is possible in two cases:

  1. The waiting thread is made to be a successor and is unparked. Upon a suspension request, the thread will suspend itself whilst clearing the successor. The OM will be left unlocked (not grabbed by any thread), while the other threads are parked until a thread grabs the OM and then exits it. The suspended thread is on the entry-list and can be selected as a successor again. None of other threads can be woken up to grab the OM until the suspended thread has been resumed and successfully releases the OM.

  2. The race between suspension and retry: the thread could reacquire the OM and complete the wait() code in full, but then on return to Java it will be suspended while holding the OM.

I'm not sure the described scenarios above are real deadlocks. The debugger has to be careful with threads suspension and can behave differently. For instance, it can suspend all threads, so no thread can progress with entering OM's. We do not consider it to be a deadlock, right? In a case just a single thread is being suspended by the debugger, other threads can depend on the suspended thread. But we do not consider it to be a deadlock. It is because everything should come to normal after the debugger resumes the suspended thread. So, could you, please, explain a little bit more on the deadlocks above? How the described in the bug scenarios are different from just listed ones? Do I miss anything?

sspitsyn avatar Nov 21 '25 07:11 sspitsyn

I'm looking at the two new tests scenarios. They look non-real and very artificial because the application plays a role of the debugger at the same time and constructs the artificial deadlocks. The threads sync on an application lock, one of them suspends and another resumes the target thread. These deadlocks are impossible to have with any normal debugger which is decoupled from application it is debugging. Sorry for saying it but it feels like there is no sense to fix this artificial issue.

sspitsyn avatar Nov 21 '25 07:11 sspitsyn

Thanks @sspitsyn for diving into the issue.

With that definition of the deadlock and suspension logic I agree that it might not be a problem at all. With this being said, is the existing test SuspendWithObjectMonitorWait demonstrating a real-world scenario? @dcubed-ojdk, what do you think?

toxaart avatar Nov 21 '25 09:11 toxaart