kotlinx.coroutines
kotlinx.coroutines copied to clipboard
Crash on GraalVM at `1.9.0-RC`
Describe the bug
When building a GraalVM native image against coroutines 1.9.0-RC, things mostly work but the following crash occurs under some conditions, for us with use of Mosaic and the new Kotlin-built-in Compose compiler:
java.lang.ClassCastException
at java.base@23/java.util.concurrent.atomic.AtomicReferenceFieldUpdater$AtomicReferenceFieldUpdaterImpl.throwAccessCheckException(AtomicReferenceFieldUpdater.java:418)
at java.base@23/java.util.concurrent.atomic.AtomicReferenceFieldUpdater$AtomicReferenceFieldUpdaterImpl.accessCheck(AtomicReferenceFieldUpdater.java:409)
at java.base@23/java.util.concurrent.atomic.AtomicReferenceFieldUpdater$AtomicReferenceFieldUpdaterImpl.get(AtomicReferenceFieldUpdater.java:466)
at kotlinx.coroutines.CancellableContinuationImpl.getParentHandle(CancellableContinuationImpl.kt:103)
at kotlinx.coroutines.CancellableContinuationImpl.detachChild$kotlinx_coroutines_core(CancellableContinuationImpl.kt:569)
at kotlinx.coroutines.CancellableContinuationImpl.detachChildIfNonResuable(CancellableContinuationImpl.kt:562)
at kotlinx.coroutines.CancellableContinuationImpl.resumeImpl$kotlinx_coroutines_core(CancellableContinuationImpl.kt:503)
at kotlinx.coroutines.CancellableContinuationImpl.resumeImpl$kotlinx_coroutines_core$default(CancellableContinuationImpl.kt:493)
at kotlinx.coroutines.CancellableContinuationImpl.resumeUndispatched(CancellableContinuationImpl.kt:596)
at kotlinx.coroutines.EventLoopImplBase$DelayedResumeTask.run(EventLoop.common.kt:497)
at kotlinx.coroutines.EventLoopImplBase.processNextEvent(EventLoop.common.kt:263)
at kotlinx.coroutines.BlockingCoroutine.joinBlocking(Builders.kt:95)
at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking(Builders.kt:69)
at kotlinx.coroutines.BuildersKt.runBlocking(Unknown Source)
at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking$default(Builders.kt:47)
at kotlinx.coroutines.BuildersKt.runBlocking$default(Unknown Source)
at com.jakewharton.mosaic.BlockingKt.runMosaicBlocking(blocking.kt:6)
at elide.tool.cli.AbstractToolCommand.call(AbstractToolCommand.kt:221)
at elide.tool.cli.AbstractToolCommand.call(AbstractToolCommand.kt:29)
at picocli.CommandLine.executeUserObject(CommandLine.java:2045)
at picocli.CommandLine.access$1500(CommandLine.java:148)
at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2457)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2419)
at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277)
at picocli.CommandLine$RunLast.execute(CommandLine.java:2421)
at picocli.CommandLine.execute(CommandLine.java:2174)
at elide.tool.cli.ElideTool$Companion.exec$runtime(ElideTool.kt:232)
at elide.tool.cli.ElideTool$Companion.main(ElideTool.kt:195)
at elide.tool.cli.ElideTool.main(ElideTool.kt)
Provide a Reproducer
See here for a partner issue with detailed tracing. A reproducer is available.
@sgammon could you please provide an instruction on how to run a reproducer?
@fzhinkin Yes, I have a reproducer. Here is one that works with a very simple native image build. It includes Coroutines 1.9.0-RC and Mosaic, where I have filed a related issue.
It seems like most of the stacktraces I can produce point to coroutines and/or atomicfu. I have much more detailed tracing in the following issues:
- Tracking issue (main, on our codebase): elide-develide#972
- Mosaic: JakeWharton/mosaic#386
- (Pending: GraalVM)
Reproducer: coroutines-crash-reproducer-972.zip
I may need to file a bug with Micronaut as well if it relates to their code, but so far I don't see any hint that it does. I'm using the Micronaut project generator because it gets really close to our dependencies anyway (ours is a Micronaut/Picocli command line application).
There seem to be multiple exceptions or crashes, or maybe multiple bugs interplaying, so if I come up with other reproducers I will post them here and on the tracking ticket.
As far as I can tell, though, most hints points to either coroutines or atomicfu; specifically, the exception depicted surfaces if optimizations are in -O2. With -Ob, the exception goes away. So, maybe this is related to some optimization GraalVM is doing, but the class cast exception could still be valid, I don't know.
@sgammon, thank you for sharing the reproducer!
I managed to reproduce the crash locally on macos-aarch64 with Graal EE 21, 22, and 23. However, on Linux (both aarch64 and x86_64), the issue did not show up.
For me, the issue also gone after switching optimization level to -O1.
As a side note, commands to run the reproducer are (taken from the Graal's GH issue):
./gradlew nativeCompile
./build/native/nativeCompile/demo
For the record, the stack trace I ended up with is:
Stacktrace for the failing thread 0x0000000126e05140 (A=AOT compiled, J=JIT compiled, D=deoptimized, i=inlined):
A SP 0x000000016b0525f0 IP 0x0000000105abe680 size=96 java.lang.Class.getTypeName(DynamicHub.java)
A SP 0x000000016b052650 IP 0x00000001051f4148 size=768 com.oracle.svm.core.snippets.ImplicitExceptions.throwNewClassCastExceptionWithArgs(ImplicitExceptions.java:311)
A SP 0x000000016b052950 IP 0x00000001064f4164 size=96 kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:28)
A SP 0x000000016b0529b0 IP 0x00000001069aaaf4 size=96 kotlinx.coroutines.DispatchedTaskKt.resume(DispatchedTask.kt:229)
i SP 0x000000016b052a10 IP 0x00000001069a1480 size=80 kotlinx.coroutines.DispatchedTaskKt.dispatch(DispatchedTask.kt:162)
A SP 0x000000016b052a10 IP 0x00000001069a1480 size=80 kotlinx.coroutines.CancellableContinuationImpl.dispatchResume(CancellableContinuationImpl.kt:470)
i SP 0x000000016b052a60 IP 0x00000001069abee4 size=80 kotlinx.coroutines.CancellableContinuationImpl.resumeImpl$kotlinx_coroutines_core(CancellableContinuationImpl.kt:504)
i SP 0x000000016b052a60 IP 0x00000001069abee4 size=80 kotlinx.coroutines.CancellableContinuationImpl.resumeImpl$kotlinx_coroutines_core$default(CancellableContinuationImpl.kt:493)
i SP 0x000000016b052a60 IP 0x00000001069abee4 size=80 kotlinx.coroutines.CancellableContinuationImpl.resumeUndispatched(CancellableContinuationImpl.kt:596)
A SP 0x000000016b052a60 IP 0x00000001069abee4 size=80 kotlinx.coroutines.EventLoopImplBase$DelayedResumeTask.run(EventLoop.common.kt:497)
A SP 0x000000016b052ab0 IP 0x00000001069ae57c size=48 kotlinx.coroutines.EventLoopImplBase.processNextEvent(EventLoop.common.kt:263)
A SP 0x000000016b052ae0 IP 0x000000010699d8f0 size=80 kotlinx.coroutines.BlockingCoroutine.joinBlocking(Builders.kt:95)
i SP 0x000000016b052b30 IP 0x0000000105062528 size=112 kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking(Builders.kt:69)
i SP 0x000000016b052b30 IP 0x0000000105062528 size=112 kotlinx.coroutines.BuildersKt.runBlocking(Unknown Source)
i SP 0x000000016b052b30 IP 0x0000000105062528 size=112 kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking$default(Builders.kt:47)
i SP 0x000000016b052b30 IP 0x0000000105062528 size=112 kotlinx.coroutines.BuildersKt.runBlocking$default(Unknown Source)
A SP 0x000000016b052b30 IP 0x0000000105062528 size=112 com.jakewharton.mosaic.BlockingKt.runMosaicBlocking(blocking.kt:6)
A SP 0x000000016b052ba0 IP 0x0000000104f9f068 size=48 com.example.DemoCommand.run(DemoCommand.kt:24)
...
I tried to debug the issue a bit more, and some outcomes so far:
- the crash from https://github.com/Kotlin/kotlinx.coroutines/issues/4146#issuecomment-2149476543 happens with coroutines
1.8.1too; - as I mentioned in https://github.com/Kotlin/kotlinx.coroutines/issues/4146#issuecomment-2149375598, the issue does not show up for binaries compiled with
-O1, but the method where the crash happens compiles the same way with both-O1and-O2.
The crash is reproducible with Kotlin 1.9.24, Mosaic 0.12.0 and some older compose version: 972-w-1924.zip
It is also reproducible with GraalEE 17. In all crash occurrences, it seems like we're hitting NPE:
(lldb) disassemble
demo`CombinedContext_get_210292045824580ad9d3693d539e87b522b69d40:
0x10191c210 <+0>: sub x8, sp, #0x40
0x10191c214 <+4>: ldr x9, [x28, #0x8]
0x10191c218 <+8>: cmp x8, x9
0x10191c21c <+12>: b.ls 0x10191c364 ; <+340>
0x10191c220 <+16>: stp x29, x30, [sp, #-0x10]
0x10191c224 <+20>: sub x29, sp, #0x10
0x10191c228 <+24>: mov sp, x8
0x10191c22c <+28>: str x1, [sp, #0x28]
0x10191c230 <+32>: cmp x1, x27
0x10191c234 <+36>: b.eq 0x10191c2e8 ; <+216>
0x10191c238 <+40>: b 0x10191c2bc ; <+172>
0x10191c23c <+44>: nop
0x10191c240 <+48>: nop
0x10191c244 <+52>: nop
0x10191c248 <+56>: nop
0x10191c24c <+60>: nop
0x10191c250 <+64>: str x0, [sp, #0x20]
0x10191c254 <+68>: mov w30, w30
0x10191c258 <+72>: add x30, x27, x30, lsl #3
0x10191c25c <+76>: ldr w2, [x30]
0x10191c260 <+80>: lsr w2, w2, #5
0x10191c264 <+84>: mov w2, w2
0x10191c268 <+88>: add x2, x27, x2, lsl #3
-> 0x10191c26c <+92>: ldr x2, [x2, #0xf0]
0x10191c270 <+96>: mov x0, x30
0x10191c274 <+100>: mov x4, x1
0x10191c278 <+104>: mov x30, x2
0x10191c27c <+108>: blr x30
0x10191c280 <+112>: nop
0x10191c284 <+116>: cmp x0, x27
0x10191c288 <+120>: b.ne 0x10191c2cc ; <+188>
0x10191c28c <+124>: ldr x0, [sp, #0x20]
0x10191c290 <+128>: ldr w1, [x0, #0x4]
0x10191c294 <+132>: mov w30, w1
0x10191c298 <+136>: add x30, x27, x30, lsl #3
0x10191c29c <+140>: cbz w1, 0x10191c30c ; <+252>
0x10191c2a0 <+144>: ldr w0, [x30]
0x10191c2a4 <+148>: lsr w0, w0, #5
0x10191c2a8 <+152>: mov w2, #0x965f
0x10191c2ac <+156>: movk w2, #0x18, lsl #16
0x10191c2b0 <+160>: cmp w0, w2
0x10191c2b4 <+164>: b.ne 0x10191c30c ; <+252>
0x10191c2b8 <+168>: mov x0, x30
0x10191c2bc <+172>: ldr x1, [sp, #0x28]
0x10191c2c0 <+176>: ldr w30, [x0, #0x8]
0x10191c2c4 <+180>: cbnz w30, 0x10191c250 ; <+64>
0x10191c2c8 <+184>: b 0x10191c354 ; <+324>
0x10191c2cc <+188>: ldp x29, x30, [sp, #0x30]
0x10191c2d0 <+192>: add sp, sp, #0x40
0x10191c2d4 <+196>: ldr w8, [x28, #0x10]
0x10191c2d8 <+200>: subs w8, w8, #0x1
0x10191c2dc <+204>: str w8, [x28, #0x10]
0x10191c2e0 <+208>: b.le 0x10191c368 ; <+344>
0x10191c2e4 <+212>: ret
0x10191c2e8 <+216>: str x0, [sp, #0x20]
0x10191c2ec <+220>: mov w2, #0x4f90
0x10191c2f0 <+224>: movk w2, #0x11, lsl #16
0x10191c2f4 <+228>: add x2, x27, x2, lsl #3
0x10191c2f8 <+232>: mov x0, x2
0x10191c2fc <+236>: bl 0x101925da0 ; Intrinsics_throwParameterIsNullNPE_d91c2ab47c1f26bdc56e3112fa34c2d996ec1c47
0x10191c300 <+240>: nop
0x10191c304 <+244>: ldr x0, [sp, #0x20]
0x10191c308 <+248>: b 0x10191c2bc ; <+172>
0x10191c30c <+252>: cbz w1, 0x10191c35c ; <+332>
0x10191c310 <+256>: ldr w0, [x30]
0x10191c314 <+260>: lsr w0, w0, #5
0x10191c318 <+264>: mov w0, w0
0x10191c31c <+268>: add x0, x27, x0, lsl #3
0x10191c320 <+272>: ldr x2, [x0, #0xf0]
0x10191c324 <+276>: mov x0, x30
0x10191c328 <+280>: ldr x1, [sp, #0x28]
0x10191c32c <+284>: mov x30, x2
0x10191c330 <+288>: blr x30
0x10191c334 <+292>: nop
0x10191c338 <+296>: ldp x29, x30, [sp, #0x30]
0x10191c33c <+300>: add sp, sp, #0x40
0x10191c340 <+304>: ldr w8, [x28, #0x10]
0x10191c344 <+308>: subs w8, w8, #0x1
0x10191c348 <+312>: str w8, [x28, #0x10]
0x10191c34c <+316>: b.le 0x10191c368 ; <+344>
0x10191c350 <+320>: ret
0x10191c354 <+324>: bl 0x1006ad180 ; ImplicitExceptions_throwNewNullPointerException_83e8a13d7d211b9efc4f752fd9d24059e9defb61
0x10191c358 <+328>: nop
0x10191c35c <+332>: bl 0x1006ad180 ; ImplicitExceptions_throwNewNullPointerException_83e8a13d7d211b9efc4f752fd9d24059e9defb61
0x10191c360 <+336>: nop
0x10191c364 <+340>: b 0x1005b3b70 ; StackOverflowCheckImpl_throwNewStackOverflowError_31341960d080a71e3dff8d322e20c16c7dc860eb
0x10191c368 <+344>: b 0x1006c5780 ; aq_enterSlowPathSafepointCheckObject_4ef7cb08b9d10a255d4fb63cbcd79b0bd0e7a19c
0x10191c36c <+348>: .long 0xcccccccc ; unknown opcode
(lldb) register read
General Purpose Registers:
x0 = 0x0000000282f315d8
x1 = 0x00000002815b61e0
x2 = 0x0000000280000000
x3 = 0x00000000ffffffd9
x4 = 0x0000000282f31610
x5 = 0x0000000282f5e3e0
x6 = 0x00000000005e6291
x7 = 0x00000000a0000802
x8 = 0x000000016fdfe620
x9 = 0x000000016f60e000
x10 = 0x0000000000000000
x11 = 0x0000000282f5e3e8
x12 = 0x0000000000000001
x13 = 0x0000000000000000
x14 = 0x0000000282f5e278
x15 = 0x0000000280f31b10
x16 = 0x0000000282f5e3d0
x17 = 0x00000002814ac170
x18 = 0x0000000000000000
x19 = 0x00000000005e7ebb
x20 = 0x0000000282f3f5d8
x21 = 0x00000000005e7e46
x22 = 0x0000000283000000
x23 = 0x0000000282f3f248
x24 = 0x0000000000000000
x25 = 0x0000000000000000
x26 = 0x0000000282f31270
x27 = 0x0000000280000000
x28 = 0x0000000156704b80
fp = 0x000000016fdfe650
lr = 0x0000000282f2f458
sp = 0x000000016fdfe620
pc = 0x000000010191c26c demo`CombinedContext_get_210292045824580ad9d3693d539e87b522b69d40 + 92
cpsr = 0x20001000
At the crash time, memory referred by x30 contains 0.
@sgammon have you succeeded filing an issue against GraalVM?
From where we are at, it seems like it's rather a Graal issue than not
@qwwdfsad Yes, there is an issue filed with GraalVM. I'll try to cross tag it.
I would have thought the same thing but I just recently witnessed this bug surface with Proguard too.
It seems this issue can surface through bytecode optimization as well as AOT
@sgammon, https://github.com/oracle/graal/issues/9046 was closed as the assigned engineer is missing a reproducer published on github. Could you please take a look @ https://github.com/oracle/graal/issues/9046#issuecomment-2194811266?
@fzhinkin I didn't get the tag! Yes, I will respond there now, and ping the GraalVM team on their Slack. That is the best place to collaborate with them. Maybe you guys could join there as well, since Kotlin uses Slack.
You can join here. I'll ping them now. Thanks for letting me know
@qwwdfsad Unfortunately I am still seeing this crash. I can replicate it in an easier way now. Should I file a new bug, or has it been determined that this can't be fixed from the Coroutines side? It still seems to be related to AtomicFU.
We are all for applying the workaround if there is one, but the core issue is on the Graal side and worth addressing here. I believe it still awaits the reproducer there.