domainslib
domainslib copied to clipboard
Bug in channel implementation using mutex and condition variables
There seem to be a bug in channel implementation using mutex+condition variable as implemented in PR #17 which causes task_exn test to deadlock or loop infintely. The bug manifest itself only in some runs otherwise it works fine. I was able to find this subtle bug using ParaFuzz.
Steps to reproduce
- Install ParaFuzz with its dependencies.
- Install domainslib based on PR #17 & modified to work with parafuzz.
- Compile domainslib's test_exn.ml:
ocamlfind opt -linkpkg -package parafuzz-lib,domainslib task_exn.ml
- Execute binary few times to catch the bug.
Root cause
I had debugged the test program and suspect that this has to do with channel implementation using single mutex+condvar for all channels. Exact root cause is when two receivers are waiting for a message here and when sender sends a message here, it causes message to get lost. Receivers are waiting on same condition variable for the sender to write to a ref unique per waiting receiver. But the problem occurs when sender writes to ref r1(waited upon by receiver 1) and Condition_variable.signal signals wakes up receiver 2 which finds ref r2 not updated, so it waits again. So even after sending a message, both receiver are still waiting and the message is lost. The bug is mainly caused by the underlying assumption that waiting threads are woken up in the order in which they wait on the condition variable which is incorrect according to pthread_cond_signal documentation.
Fix
Fix is to use a unique mutex+condition_variable for each waiting receiver.
Thanks for fuzzing domainslib, it is useful.
I'm a bit confused how your proposed fix will work. Isn't there a unique mutex+condition_variable for each waiting receiver in recv' already:
https://github.com/ocaml-multicore/domainslib/blob/3d5f79c5cf6a295b8c0652fb7470211b33b05b81/lib/chan.ml#L150
I can see there is the problem if domainslib is used with systhreads.
With systhreads and continuations/fibers it is possible for multiple execution contexts (aka 'threads') to share the same Domain.DLS.get mutex_condvar_key since Domain.DLS is specific at the domain level.
I make fixing the problem with systhreads and continuations/fibers being to replace Condition.signal with Condition.broadcast.
With systhreads and continuations/fibers it is possible for multiple execution contexts (aka 'threads') to share the same
Domain.DLS.get mutex_condvar_keysinceDomain.DLSis specific at the domain level.I make fixing the problem with systhreads and continuations/fibers being to replace
Condition.signalwithCondition.broadcast.
I agree on the fix of replacing Condition.signal with Condition.broadcast as waiting receivers will still check for ref Msg_slot to be written, thereby waking up a single waiting receiver.