zfs kernel deadlock when `zfs recv` is interrupted by a signal

The user thread doing the ioctl for zfs recv eventually gets (on the kernel side) to dmu_recv_stream which calls bqueue_enqueue for each block. bqueue_enqueue has:

	while (q->bq_size + item_size > q->bq_maxsize) {
		cv_wait_sig(&q->bq_add_cv, &q->bq_lock);
	}

cv_wait_sig is defined by (debug ifdefs elided)

#define cv_wait_sig(cvp, mp)        \
	spl_cv_wait((cvp), (mp), PRIBIO|PCATCH, #cvp)

void
spl_cv_wait(kcondvar_t *cvp, kmutex_t *mp, int flags, const char *msg)
{
    if (msg != NULL && msg[0] == '&')
        ++msg;  /* skip over '&' prefixes */

	mp->m_owner = NULL;
    (void) msleep(cvp, (lck_mtx_t *)&mp->m_lock, flags, msg, 0);
    mp->m_owner = current_thread();
}

This percolates down through xnu until it reaches lck_mtx_sleep (e.g. https://opensource.apple.com/source/xnu/xnu-2050.18.24/osfmk/kern/locks.c.auto.html) which does

	res = assert_wait(event, interruptible);
	if (res == THREAD_WAITING) {
		lck_mtx_unlock(lck);
		res = thread_block(THREAD_CONTINUE_NULL);
		if (!(lck_sleep_action & LCK_SLEEP_UNLOCK)) {
			if ((lck_sleep_action & LCK_SLEEP_SPIN))
				lck_mtx_lock_spin(lck);
			else
				lck_mtx_lock(lck);
		}
	}
	else
	if (lck_sleep_action & LCK_SLEEP_UNLOCK)
		lck_mtx_unlock(lck);

	KERNEL_DEBUG(MACHDBG_CODE(DBG_MACH_LOCKS, LCK_MTX_SLEEP_CODE) | DBG_FUNC_END, (int)res, 0, 0, 0, 0);

	return res;

If the current thread has already received a signal, then assert_wait returns THREAD_INTERRUPTED, lck_mtx_sleep never releases the mutex and returns, eventually all the way back to bqueue_enqueue's call to cv_wait_sig. But, since the mutex was never released, no bqueue_dequeue calls will have been able to make progress, and bqueue_enqueue will repeat this process forever.

I think the correct fix here would be to modify bqueue_enqueue (and bqueue_dequeue, which has the same problem) to use the uninterruptible versions; since there's nothing they can do (without changing their API) except retry, they might as well use the uninterruptible versions.

I have not looked to see if there are other uses of cv_wait_interruptible/cv_timedwait_interruptible/cv_wait_sig/cv_timedwait_sig which have issues.

Feb 08 '21 23:02 jesboat

Thanks for the issue and the details of it. We pass PCATCH to msleep() so we can detect the user hitting ^C for the signal, and have the kernel parts undo. I don't recall specifically which part detects that, but I don't think it was bqueue_dequeue(), so potentially what you suggest could work.

Potentially we could also detect THREAD_INTERRUPTED and do a little manual mutex release for progress.

Feb 09 '21 07:02 lundman

OK, so yes, you can definitely just change it to cv_wait() (without the _sig). But presumably upstream picked the _sig for a reason relevant to their platform.

I can also add, at the end of spl_cv_wait():

+    if (result == EINTR &&
+        (mp->m_waiters > 0 || mp->m_sleepers > 0)) {
+        printf("avoiding hang\n");
+        mutex_exit(mp);
+        (void) thread_block(THREAD_CONTINUE_NULL);
+        mutex_enter(mp);
+    }

    /*
     * 1 - condvar got cv_signal()/cv_broadcast()
     * 0 - received signal (kill -signal)
     */
    return (result == EINTR ? 0 : 1);

I added waiters, and sleepers just to avoid doing it everytime. Now to decide which fix to go with.

Mar 12 '21 08:03 lundman

zfs zfs copied to clipboard

kernel deadlock when `zfs recv` is interrupted by a signal

zfs
zfs copied to clipboard