kotlinx.coroutines icon indicating copy to clipboard operation
kotlinx.coroutines copied to clipboard

Port a bugfix for the coroutine scheduler

Open dkhalanskyjb opened this issue 4 months ago • 4 comments

https://github.com/Kotlin/kotlinx.coroutines/pull/4132/commits/cc1ad3d33dc787a7402f1f442e256e5029e04cd2 describes a bug in the coroutine scheduler. We should look at it, write tests confirming the bug, and port the fix.

dkhalanskyjb avatar Jul 31 '25 11:07 dkhalanskyjb

What are the possible consequences of this bug?

LouisCAD avatar Aug 20 '25 08:08 LouisCAD

@vsalavatov could you provide some context for your commit, e.g. how you discovered the bug? I am also interested in this, as I might work on bringing it in.

murfel avatar Aug 20 '25 16:08 murfel

The comment in the commit explains the race; it's very similar to https://github.com/Kotlin/kotlinx.coroutines/issues/3660 (also, seems to be fixed with this commit as well, though requires a check).

Taking how the scheduler is implemented [and polished] at this point, all concurrency bugs have the same nature:

Under very specific circumstances, there is a so-called "liveness" miss -- there is a task in some queue, there are idle threads, but these threads are not looking for the task. The moment the scheduler state is disturbed (i.e. another task got sent), the problem disappears. Both bugs (one in comments and one in the issue) are of this nature.

You might want to look at scheduler stress tests as they are written in a very specific manner to catch bugs like that -- the scheduler is loaded and then all threads halt (with a latch or a barrier) and with a timeout we detect lack of progress. Or reset and try again.

qwwdfsad avatar Aug 20 '25 16:08 qwwdfsad

@murfel, sure! I've had a lot of fun with this bug (literally)! :)

I discovered it while I was working on #4132. After I've written a patch, I wrote a liveness stress test for parallelism compensation (here). It stalled sometimes, I believe a few runs out of 1000. I decided to investigate this stall using logging. At first I tried to use plain printlns right from the coroutine scheduler code, but it made the problem worse. I suspect that syscalls inside prints may not play well when you expect certain guarantees about who and when consumes LockSupport.park/unpark requests. So I decided to work it around and instead of using prints I've put message strings into a cyclic buffer that was dumped after a certain timeout by another thread (here). After adding a bunch of log points and reviewing the logs, this scenario became apparent. This approach still changes the coroutine scheduler behavior somewhat (atomic writes may affect memory ordering), but I guess there is no better way (I'll be glad if someone proves me wrong) :)

vsalavatov avatar Aug 20 '25 17:08 vsalavatov