openj9 icon indicating copy to clipboard operation
openj9 copied to clipboard

Timedout at Continuation.yield() - java/lang/Thread/virtual/ThreadAPI.java

Open JasonFengJ9 opened this issue 5 months ago • 3 comments

Copied from https://github.com/eclipse-openj9/openj9/issues/21683#issuecomment-2916412694

https://openj9-jenkins.osuosl.org/job/Test_openjdk24_j9_sanity.openjdk_s390x_linux_Nightly_testList_2/86/consoleFull

00:50:20  Running test jdk_lang_j9_1 ...

00:50:21  variation: -Xdump:system:none -Xdump:heap:none -Xdump:system:events=gpf+abort+traceassert+corruptcache -XX:-JITServerTechPreviewMessage Mode501 -XXgc:fvtest_forceCopyForwardHybridMarkCompactRatio=10
00:50:21  JVM_OPTIONS:  -Xdump:system:none -Xdump:heap:none -Xdump:system:events=gpf+abort+traceassert+corruptcache -XX:-JITServerTechPreviewMessage -Xjit -Xgcpolicy:balanced -Xnocompressedrefs -XXgc:fvtest_forceCopyForwardHybridMarkCompactRatio=10 -Xverbosegclog 

01:38:42  TEST: java/lang/Thread/virtual/Parking.java#Xint

01:38:42  TEST RESULT: Error. Program `/home/jenkins/workspace/Test_openjdk24_j9_sanity.openjdk_s390x_linux_Nightly_testList_2/jdkbinary/j2sdk-image/bin/java' timed out (timeout set to 960000ms, elapsed time including timeout handling was 986157ms).
01:38:42  --------------------------------------------------
01:38:42  Test results: passed: 925; error: 1
01:38:49  jdk_lang_j9_1_FAILED

50x internal grinder jdk_lang_j9 s390x_linux - still in progress, the timeout wasn't reproduced yet

https://openj9-jenkins.osuosl.org/job/Test_openjdk24_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/92/consoleFull

03:25:09  Running test jdk_lang_0 ...

03:25:09  variation: -Xdump:system:none -Xdump:heap:none -Xdump:system:events=gpf+abort+traceassert+corruptcache -XX:-JITServerTechPreviewMessage Mode150
03:25:09  JVM_OPTIONS:  -Xdump:system:none -Xdump:heap:none -Xdump:system:events=gpf+abort+traceassert+corruptcache -XX:-JITServerTechPreviewMessage -XX:+UseCompressedOops -Xverbosegclog 

04:34:03  TEST: java/lang/Thread/virtual/ThreadAPI.java#default

04:34:03  TEST RESULT: Error. Program `/home/jenkins/workspace/Test_openjdk24_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/jdkbinary/j2sdk-image/bin/java' timed out (timeout set to 960000ms, elapsed time including timeout handling was 1023791ms).
04:34:03  --------------------------------------------------
04:46:56  Test results: passed: 919; error: 1

04:47:00  jdk_lang_0_FAILED

50x internal grinder jdk_lang ppc64_aix - still in progress, multiple timeouts have been reproduced

Copied from https://github.com/eclipse-openj9/openj9/issues/21683#issuecomment-2959494567

I don't think the Parking test is related, the test is stuck yielding, not waiting on a monitor.

!continuationstack 0x000003fe80005b60
<3fe80005b60> 	!j9method 0x000003FF98322220   jdk/internal/vm/Continuation.yieldImpl(Z)Z
<3fe80005b60> 	!j9method 0x000003FF983220C0   jdk/internal/vm/Continuation.yield0()Z
<3fe80005b60> 	!j9method 0x000003FF983220A0   jdk/internal/vm/Continuation.yield(Ljdk/internal/vm/ContinuationScope;)Z
<3fe80005b60> 	!j9method 0x000003FF98311B90   java/lang/VirtualThread.yieldContinuation()Z
<3fe80005b60> 	!j9method 0x000003FF98311C70   java/lang/VirtualThread.park()V
<3fe80005b60> 	!j9method 0x000003FF98337E68   java/lang/Access.parkVirtualThread()V
<3fe80005b60> 	!j9method 0x000003FF9830B7C8   java/util/concurrent/locks/LockSupport.park()V
<3fe80005b60> 	!j9method 0x000003FF985DA668   Parking.lambda$testPark7$0()V
<3fe80005b60> 	!j9method 0x000003FE84470138   Parking$$Lambda/0x000000008443ebb8.run()V
<3fe80005b60> 	!j9method 0x000003FF982F38F0   java/lang/Thread.runWith(Ljava/lang/Object;Ljava/lang/Runnable;)V
<3fe80005b60> 	!j9method 0x000003FF98311B30   java/lang/VirtualThread.run(Ljava/lang/Runnable;)V
<3fe80005b60> 	!j9method 0x000003FE8441A050   java/lang/VirtualThread$VThreadContinuation$1.run()V
<3fe80005b60> 	!j9method 0x000003FF98322060   jdk/internal/vm/Continuation.enter(Ljdk/internal/vm/Continuation;)V
<3fe80005b60> 	                        JNI call-in frame
<3fe80005b60> 	                        Native method frame

The ThreadAPI Timeout looks quite similar and may be related. The only difference I see is that the last waiting vthread is in a different state than CancelTimerWithContention:

!vthreads
!continuationstack 0x0000010023bf4330 !j9vmcontinuation 0x0000010023bf4330 !j9object 0x00000000FFEE7960 (Continuation) !j9object 0x00000000FFEE7898 (VThread) -
!j9object 0x00000000FFEE7898
...
I state = 0x0000000D (offset = 112) (java/lang/VirtualThread)
Z parkPermit = 0x00000001 (offset = 160) (java/lang/VirtualThread)
Z blockPermit = 0x00000000 (offset = 164) (java/lang/VirtualThread)

JasonFengJ9 avatar Jun 10 '25 15:06 JasonFengJ9

java/lang/Thread/virtual/Parking.java#Xint timeout

The failing job was too old and had already been deleted.

I ran a 200x grinder for Parking.java#Xint on zLinux: https://openj9-jenkins.osuosl.org/job/Grinder/4395

The timeout in Parking.java#Xint was not reproducible with the latest nightly build.

babsingh avatar Jun 12 '25 17:06 babsingh

The timeout in Parking.java#Xint was not reproducible with the latest nightly build.

java/lang/Thread/virtual/Parking.java#Xint timeout wasn't seen in recent builds. It wasn't excluded.

Updated the issue title.

JasonFengJ9 avatar Jun 12 '25 17:06 JasonFengJ9

java/lang/Thread/virtual/ThreadAPI.java#default timeout

There are two issues.

=======

Issue 1

We likely missed a code path in VirtualThread.afterYield() due to a race between the thread transitioning to a WAIT state and lock.notify() being called:

if (notified && compareAndSetState(newState, BLOCKED)) {
    if (blockPermit && compareAndSetState(BLOCKED, UNBLOCKED)) {
        submitRunContinuation();
    }
    return;
}

The issue happens when notifyVirtualThread() is called while the virtual thread is still in TIMED_WAITING. In this case:

  • The thread is removed from waitingContinuations and added to blockedContinuations,
  • But its state is never set to BLOCKED.

Later calls to notifyVirtualThread() won’t find the thread in waitingContinuations, so:

  • The thread’s state remains unchanged,
  • omrthread_monitor_notify(vm->blockedVirtualThreadsMutex) isn’t called.

Solution for Issue 1

In takeVirtualThreadListToUnblock(), when processing WAIT or TIMED_WAIT threads that were notified, we should set their state to BLOCKED and add them to the unblockedList.

In the test case,

synchronized (lock) {
    lock.notify();  // <-- Will do nothing since the vthread has been removed from the waitingContinuations list
} <-- But, objectMonitorExit will notify blockedVirtualThreadsMutex since virtualThreadWaitCount > 0

=======

Issue 2

There’s a potential race condition with VirtualThread.afterYield():

  • afterYield() may update the state through the sequence: TIMED_WAITING → TIMED_WAIT → BLOCKED → UNBLOCKED
  • If notifyVirtualThread() reads a stale state (TIMED_WAIT) while afterYield() has already moved the thread to UNBLOCKED,
  • Then notifyVirtualThread() may incorrectly overwrite the state: UNBLOCKED → BLOCKED

The virtual thread would never resume execution, and this would lead to a hang.

Solution for Issue 2

We should be using a compare-and-swap (CAS) when updating the virtual thread’s state to BLOCKED in notifyVirtualThread() to avoid races:

U_32 state = J9VMJAVALANGVIRTUALTHREAD_STATE(vmThread, vthread);
if ((JAVA_LANG_VIRTUALTHREAD_WAIT == state) || (JAVA_LANG_VIRTUALTHREAD_TIMED_WAIT == state)) {
    // Use CAS to update state to BLOCKED
    J9VMJAVALANGVIRTUALTHREAD_SET_STATE(vmThread, vthread, JAVA_LANG_VIRTUALTHREAD_BLOCKED);
    // CAS ensures state is only changed if still in WAIT or TIMED_WAIT
}

=======

@theresa-m Since you're trying the solution for issue 1 to resolve #21683, could you also try the solution for issue 2?

babsingh avatar Jun 12 '25 18:06 babsingh

I tested the proposed solution for the second issue and the timeout still occurs. @fengxue-IS can you double check if this looks like it would solve the problem? https://github.com/theresa-m/openj9/tree/threadapi

theresa-m avatar Jun 18 '25 19:06 theresa-m

The code change looks correct to me, do you have the log output to see what is still causing the hang?

fengxue-IS avatar Jun 18 '25 21:06 fengxue-IS

It looks very similar to the failure seen in https://github.com/eclipse-openj9/openj9/issues/21683 except that the state is in BLOCKED. This core failed in testGetState6, the failure is sometimes seen in testGetState7 as well. The running virtual thread is in Object.wait and the MainThread is in Unsafe.park from Thread.join. The virtual thread is in vm->blockedContinuations.

> !vthreads
!continuationstack 0x000001002505b2d0 !j9vmcontinuation 0x000001002505b2d0 !j9object 0x00000000FFE52B60 (Continuation) !j9object 0x00000000FFE52A98 (VThread) - 
!continuationstack 0x0000000000000000 !j9vmcontinuation 0x0000000000000000 !j9object 0x00000000FFE37038 (Continuation) !j9object 0x0000000000000000 (VThread) - <N/A>
!continuationstack 0x0000000000000000 !j9vmcontinuation 0x0000000000000000 !j9object 0x00000000FFE13770 (Continuation) !j9object 0x0000000000000000 (VThread) - <N/A>
!continuationstack 0x0000000000000000 !j9vmcontinuation 0x0000000000000000 !j9object 0x00000000FFDEAF70 (Continuation) !j9object 0x0000000000000000 (VThread) - <N/A>
!continuationstack 0x0000000000000000 !j9vmcontinuation 0x0000000000000000 !j9object 0x00000000FFDF44B0 (Continuation) !j9object 0x00000000FFDF43D0 (VThread) - 
> !j9object 0x00000000FFE52A98 (VThread)
!J9Object 0x00000000FFE52A98 {
	...
	I state = 0x0000000D (offset = 112) (java/lang/VirtualThread) <-- state BLOCKED
	Z parkPermit = 0x00000001 (offset = 160) (java/lang/VirtualThread)
	Z blockPermit = 0x00000000 (offset = 164) (java/lang/VirtualThread) 
	Z onWaitingList = 0x00000000 (offset = 168) (java/lang/VirtualThread)
	Ljava/lang/VirtualThread; next = !fj9object 0x0 (offset = 144) (java/lang/VirtualThread)
	Z notified = 0x00000001 (offset = 172) (java/lang/VirtualThread) <-- notifyVirtualThread was invoked
	B timedWaitSeqNo = 0x00000000 (offset = 176) (java/lang/VirtualThread)
	J timeout = 0x0000000000000000 (offset = 116) (java/lang/VirtualThread)
	...
}

> !j9object 0x00000000FFE52B60 (Continuation)
!J9Object 0x00000000FFE52B60 {
	struct J9Class* clazz = !j9class 0x304E1E00 // java/lang/VirtualThread$VThreadContinuation
	Object flags = 0x00000000;
	I lockword = 0x00000008 (offset = 0) (java/lang/Object) <hidden>
	J vmRef = 0x000001002505B2D0 (offset = 4) (jdk/internal/vm/Continuation)
	Ljava/lang/Thread; vthread = !fj9object 0xffe52a98 (offset = 20) (jdk/internal/vm/Continuation)
	Ljava/lang/Object; blocker = !fj9object 0xe02ceb68 (offset = 24) (jdk/internal/vm/Continuation)
	[Ljava/lang/Object; scopedValueCache = !fj9object 0x0 (offset = 28) (jdk/internal/vm/Continuation)
	Ljdk/internal/vm/ContinuationScope; scope = !fj9object 0xe0237430 (offset = 32) (jdk/internal/vm/Continuation)
	Ljava/lang/Runnable; runnable = !fj9object 0xffe52b98 (offset = 36) (jdk/internal/vm/Continuation)
	Ljdk/internal/vm/Continuation; parent = !fj9object 0x0 (offset = 40) (jdk/internal/vm/Continuation)
	J state = 0x0000000000000001 (offset = 12) (jdk/internal/vm/Continuation)
	Z isAccessible = 0x00000001 (offset = 44) (jdk/internal/vm/Continuation)
	I continuationLink = 0xFFE37038 (offset = 48) (jdk/internal/vm/Continuation) <hidden>
}

> !fj9object 0xe02ceb68
!J9Object 0x00000000E02CEB68 {
	struct J9Class* clazz = !j9class 0x30047700 // java/lang/Object
	Object flags = 0x00000008;
	I lockword = 0x30019D71 (offset = 0) (java/lang/Object) <hidden>
}

> !j9vmcontinuation 0x000001002505b2d0
J9VMContinuation at 0x1002505b2d0 {
  Fields for J9VMContinuation:
	0x0: UDATA* arg0EA = !j9x 0x0000000030606960
	0x8: UDATA* bytecodes = !j9x 0x0000000000000000
	0x10: UDATA* sp = !j9x 0x0000000030606920
	0x18: U8* pc = !j9x 0x0000000000000003
	0x20: struct J9Method* literals = !j9method 0x0000000000000000
	0x28: UDATA* stackOverflowMark = !j9x 0x00000000306022D8
	0x30: UDATA* stackOverflowMark2 = !j9x 0x00000000306022D8
	0x38: struct J9JavaStack* stackObject = !j9javastack 0x00000000306012D0
	0x40: struct J9JITDecompilationInfo* decompilationStack = !j9jitdecompilationinfo 0x0000000000000000
	0x48: UDATA* j2iFrame = !j9x 0x0000000000000000
	0x50: struct J9JITGPRSpillArea jitGPRs = !j9jitgprspillarea 0x000001002505B320
	0x160: struct J9I2JState i2jState = !j9i2jstate 0x000001002505B430
	0x180: struct J9VMEntryLocalStorage* oldEntryLocalStorage = !j9vmentrylocalstorage 0x0000000000000000
	0x188: UDATA dropFlags = 0x0000000000000000 (0)
	0x190: UDATA returnState = 0x0000000000000002 (2)
	0x198: IDATA waitingMonitorEnterCount = 0x0000000000000001 (1)
	0x1a0: UDATA ownedMonitorCount = 0x0000000000000000 (0)
	0x1a8: struct J9Pool* monitorEnterRecordPool = !j9pool 0x00000100252B6870
	0x1b0: struct J9MonitorEnterRecord* monitorEnterRecords = !j9monitorenterrecord 0x00000100252C1460
	0x1b8: struct J9MonitorEnterRecord* jniMonitorEnterRecords = !j9monitorenterrecord 0x0000000000000000
	0x1c0: struct J9Object* vthread = !j9object 0x00000000FFE52A98 // java/lang/VirtualThread
	0x1c8: struct J9VMContinuation* nextWaitingContinuation = !j9vmcontinuation 0x0000000000000000
	0x1d0: struct J9ObjectMonitor* objectWaitMonitor = !j9objectmonitor 0x0000000030019D70
	0x1d8: struct J9ObjectMonitor* enteredMonitors = !j9objectmonitor 0x0000000000000000
	0x1e0: UDATA runtimeFlags = 0x0000000000000000 (0)
	0x1e8: I64 startTicks = 0x002583A11EFAB4C7 (10559302171014343)
	0x1f0: struct J9VMThread* previousOwner = !j9vmthread 0x0000000000000000
}

> !j9objectmonitor 0x0000000030019D70
J9ObjectMonitor at 0x30019d70 {
  Fields for J9ObjectMonitor:
	0x0: struct J9ThreadMonitor* monitor = !j9threadmonitor 0x0000010023DB5FC8
	0x8: UDATA proDeflationCount = 0x0000000000000015 (21)
	0x10: UDATA antiDeflationCount = 0x0000000000000001 (1)
	0x18: j9objectmonitor_t alternateLockword = !j9x 0x30019d88
	0x20: U32 hash = 0xA9AA5132 (2846511410)
	0x24: volatile U32 virtualThreadWaitCount = 0x00000001 (1)
	0x28: volatile U32 platformThreadWaitCount = 0x00000000 (0)
	0x30: struct J9VMContinuation* ownerContinuation = !j9vmcontinuation 0x0000000000000000
	0x38: struct J9VMContinuation* waitingContinuations = !j9vmcontinuation 0x0000000000000000
	0x40: struct J9ObjectMonitor* next = !j9objectmonitor 0x0000000000000000
}

theresa-m avatar Jun 19 '25 14:06 theresa-m

Ran a 200x grinder on AIX: https://openj9-jenkins.osuosl.org/job/Grinder/4424. Also verified the test locally 150x on my xLinux machine. The timeout is no longer reproducible.

Opened https://github.com/adoptium/aqa-tests/pull/6387 to re-enable the test.

babsingh avatar Jul 04 '25 20:07 babsingh

Closing as https://github.com/adoptium/aqa-tests/pull/6387 has been merged to re-enable the test.

babsingh avatar Jul 04 '25 21:07 babsingh