chanl icon indicating copy to clipboard operation
chanl copied to clipboard

SEND/RECV deadlock...

Open nklein opened this issue 9 years ago • 8 comments

Someone posted a program to Reddit that deadlocks when using CHANL. I spent a bit of time looking into it. I think I see why it deadlocks.

The deadlock looks like a CHANL bug to me. Specifically, the use of RECV-GRABBED-VALUE-P can get into a situation where multiple SEND calls are blocking waiting for it to become true. If, while multiple SEND calls are blocking waiting for it to become true, multiple readers sneak in and all of them set it to true, then only one of the SEND calls will complete.

It seems to me this could happen even without the extra SEMAPHORE in the sample code. I think the semaphore just keeps the send threads from happening almost entirely serially.

(ql:quickload :chanl)  
(ql:quickload :bordeaux-threads)  
#+sbcl (ql:quickload :sb-concurrency)  

(defun test-chanl (n)  
  (loop for i from 0 below n do  
       (format t "~&Test: ~a~%" i)  
       (let* ((size 1024)  
              (test-repeat 4)  
              (chanl (make-instance 'chanl:unbounded-channel))  
              (semaphore #+sbcl (sb-thread:make-semaphore)  
                         #+ccl (make-semaphore)))  
         (loop repeat test-repeat do  
              (bt:make-thread  
               (lambda ()
                 #+sbcl (sb-thread:wait-on-semaphore semaphore)  
                 #+ccl (wait-on-semaphore semaphore)
                 (loop repeat size  
                    do (chanl:send chanl 1)))))  
         (loop repeat test-repeat  
            do #+sbcl (sb-thread:signal-semaphore semaphore)  
              #+ccl (signal-semaphore semaphore))  
         (loop repeat (* size test-repeat)  
            for i from 0  
            do (chanl:recv chanl))))  
  (format t "~&done.~%"))

I'll see if I can create some simpler code to trigger it. It looks like the fix would be switching to a counter (or semaphore) rather than a boolean for the number of senders waiting.

nklein avatar Nov 06 '15 17:11 nklein

This code isn't any shorter, but it does pretty reliably reproduce the issue for thread-count of 5 or 10.

(ql:quickload :chanl)  
(ql:quickload :bordeaux-threads)  

(defun test-chanl (thread-count)
  (let ((lock (bt:make-lock "counter-lock"))
        (receivers 0)
        (senders 0)
        (channel (make-instance 'chanl:unbounded-channel))
        (start nil))
    (macrolet ((with-counter ((place) &body body)
                 `(unwind-protect
                       (progn
                         (bt:with-lock-held (lock)
                           (incf ,place))
                         ,@body)
                    (bt:with-lock-held (lock)
                      (decf ,place)))))
      (flet ((recv-func ()
               (with-counter (receivers)
                 (chanl:recv channel)))

             (send-func ()
               (with-counter (senders)
                 (loop :until start :do (bt:thread-yield))
                 (chanl:send channel :foo))))

        (let ((threads (loop :for s :from 0 :below thread-count
                          :collect (bt:make-thread #'recv-func
                                                   :name (format nil
                                                                 "READER-~D"
                                                                 s))
                          :collect (bt:make-thread #'send-func
                                                   :name (format nil
                                                                 "SENDER-~D"
                                                                 s)))))

          ;; wait until all readers have started trying to read
          (loop :until (= receivers thread-count))

          ;; wait until all writers have started
          (loop :until (= senders thread-count))

          (format t "Starting SENDs.")
          (setf start t)

          ;; wait for all threads to finish
          (loop :for thread :in threads
             :do (bt:join-thread thread)))))))

nklein avatar Nov 06 '15 18:11 nklein

On 15-11-06 01:21 PM, Patrick Stein wrote:

This code isn't any shorter, but it does pretty reliably reproduce the issue for |thread-count| of 5 or 10.

|(ql:quickload :chanl) (ql:quickload :bordeaux-threads) (defun test-chanl (thread-count) (let ((lock (bt:make-lock "counter-lock")) (receivers 0) (senders 0) (channel (make-instance 'chanl:unbounded-channel)) (start nil)) (macrolet ((with-counter ((place) &body body) `(unwind-protect (progn (bt:with-lock-held (lock) (incf ,place)) ,@body) (bt:with-lock-held (lock) (decf ,place))))) (flet ((recv-func () (with-counter (receivers) (chanl:recv channel))) (send-func () (with-counter (senders) (loop :until start :do (bt:thread-yield)) (loop :repeat repeat-count :do (chanl:send channel :foo))))) (let ((threads (loop :for s :from 0 :below thread-count :collect (bt:make-thread #'recv-func :name (format nil "READER-~D" s)) :collect (bt:make-thread #'send-func :name (format nil "SENDER-~D" s))))) ;; wait until all readers have started trying to read (loop :until (= receivers thread-count)) ;; wait until all writers have started (loop :until (= senders thread-count)) (format t "Starting SENDs.") (setf start t) ;; wait for all threads to finish (loop :for thread :in threads :do (bt:join-thread thread))))))) |

— Reply to this email directly or view it on GitHub https://github.com/zkat/chanl/issues/11#issuecomment-154491785.

There's a bug in your test function. You are looping for repeat-count :repeats, which is an undefined variable.

B

fade avatar Nov 06 '15 18:11 fade

See the edit to my comment. I had repeat-count in and then took it out. Then, I accidentally undid my last modification while trying to indent everything four-spaces to paste in as Markdown code. There is no longer any repeat-count in the comment.

nklein avatar Nov 06 '15 18:11 nklein

The deadlock looks like a CHANL bug to me. Specifically, the use of RECV-GRABBED-VALUE-P can get into a situation where multiple SEND calls are blocking waiting for it to become true. If, while multiple SEND calls are blocking waiting for it to become true, multiple readers sneak in and all of them set it to true, then only one of the SEND calls will complete.

This scenario should be invalidated for unbuffered channels by the fact that the first action in their methods is requesting the channel lock; since the bug occurs even for unbuffered channels, there must be another scenario (or the issue is due to a problem with the locking implementation).

adlai avatar Mar 30 '19 14:03 adlai

The send/recv for unbuffered channel causes deadlock also. You can refer to comment:https://github.com/zkat/chanl/issues/13#issuecomment-541347280

imnisen avatar Oct 15 '19 03:10 imnisen

You can refer to comment:#13 (comment)

Thank you for bringing that comment to my attention; I'd likely have not noticed your message otherwise, due to the issue duplication.

The send/recv for unbuffered channel causes deadlock also.

Indeed, the Omega-th Law states that unfound bugs outnumber unimplemented tests.

adlai avatar Oct 16 '19 06:10 adlai

This should be fixed by commit adlai/chanl@a030296 although I am leaving the issue open until the fix is confirmed.

adlai avatar Jul 10 '20 16:07 adlai

Consideration of factors outside the immediately imported package have led to this issue's abandonment, without a confirmed fix.

adlai avatar Oct 09 '20 01:10 adlai