domainslib icon indicating copy to clipboard operation
domainslib copied to clipboard

Bug in channel implementation using mutex and condition variables

Open SumitPadhiyar opened this issue 4 years ago • 3 comments

There seem to be a bug in channel implementation using mutex+condition variable as implemented in PR #17 which causes task_exn test to deadlock or loop infintely. The bug manifest itself only in some runs otherwise it works fine. I was able to find this subtle bug using ParaFuzz.

Steps to reproduce

  1. Install ParaFuzz with its dependencies.
  2. Install domainslib based on PR #17 & modified to work with parafuzz.
  3. Compile domainslib's test_exn.ml:
ocamlfind opt -linkpkg -package parafuzz-lib,domainslib task_exn.ml
  1. Execute binary few times to catch the bug.

Root cause

I had debugged the test program and suspect that this has to do with channel implementation using single mutex+condvar for all channels. Exact root cause is when two receivers are waiting for a message here and when sender sends a message here, it causes message to get lost. Receivers are waiting on same condition variable for the sender to write to a ref unique per waiting receiver. But the problem occurs when sender writes to ref r1(waited upon by receiver 1) and Condition_variable.signal signals wakes up receiver 2 which finds ref r2 not updated, so it waits again. So even after sending a message, both receiver are still waiting and the message is lost. The bug is mainly caused by the underlying assumption that waiting threads are woken up in the order in which they wait on the condition variable which is incorrect according to pthread_cond_signal documentation.

Fix

Fix is to use a unique mutex+condition_variable for each waiting receiver.

SumitPadhiyar avatar May 04 '21 06:05 SumitPadhiyar

Thanks for fuzzing domainslib, it is useful.

I'm a bit confused how your proposed fix will work. Isn't there a unique mutex+condition_variable for each waiting receiver in recv' already: https://github.com/ocaml-multicore/domainslib/blob/3d5f79c5cf6a295b8c0652fb7470211b33b05b81/lib/chan.ml#L150

I can see there is the problem if domainslib is used with systhreads.

ctk21 avatar May 04 '21 11:05 ctk21

With systhreads and continuations/fibers it is possible for multiple execution contexts (aka 'threads') to share the same Domain.DLS.get mutex_condvar_key since Domain.DLS is specific at the domain level.

I make fixing the problem with systhreads and continuations/fibers being to replace Condition.signal with Condition.broadcast.

ctk21 avatar May 04 '21 12:05 ctk21

With systhreads and continuations/fibers it is possible for multiple execution contexts (aka 'threads') to share the same Domain.DLS.get mutex_condvar_key since Domain.DLS is specific at the domain level.

I make fixing the problem with systhreads and continuations/fibers being to replace Condition.signal with Condition.broadcast.

I agree on the fix of replacing Condition.signal with Condition.broadcast as waiting receivers will still check for ref Msg_slot to be written, thereby waking up a single waiting receiver.

SumitPadhiyar avatar May 06 '21 09:05 SumitPadhiyar