zfs
zfs copied to clipboard
kernel deadlock when `zfs recv` is interrupted by a signal
The user thread doing the ioctl for zfs recv eventually gets (on the kernel side) to dmu_recv_stream which calls bqueue_enqueue for each block. bqueue_enqueue has:
while (q->bq_size + item_size > q->bq_maxsize) {
cv_wait_sig(&q->bq_add_cv, &q->bq_lock);
}
cv_wait_sig is defined by (debug ifdefs elided)
#define cv_wait_sig(cvp, mp) \
spl_cv_wait((cvp), (mp), PRIBIO|PCATCH, #cvp)
void
spl_cv_wait(kcondvar_t *cvp, kmutex_t *mp, int flags, const char *msg)
{
if (msg != NULL && msg[0] == '&')
++msg; /* skip over '&' prefixes */
mp->m_owner = NULL;
(void) msleep(cvp, (lck_mtx_t *)&mp->m_lock, flags, msg, 0);
mp->m_owner = current_thread();
}
This percolates down through xnu until it reaches lck_mtx_sleep (e.g. https://opensource.apple.com/source/xnu/xnu-2050.18.24/osfmk/kern/locks.c.auto.html) which does
res = assert_wait(event, interruptible);
if (res == THREAD_WAITING) {
lck_mtx_unlock(lck);
res = thread_block(THREAD_CONTINUE_NULL);
if (!(lck_sleep_action & LCK_SLEEP_UNLOCK)) {
if ((lck_sleep_action & LCK_SLEEP_SPIN))
lck_mtx_lock_spin(lck);
else
lck_mtx_lock(lck);
}
}
else
if (lck_sleep_action & LCK_SLEEP_UNLOCK)
lck_mtx_unlock(lck);
KERNEL_DEBUG(MACHDBG_CODE(DBG_MACH_LOCKS, LCK_MTX_SLEEP_CODE) | DBG_FUNC_END, (int)res, 0, 0, 0, 0);
return res;
If the current thread has already received a signal, then assert_wait returns THREAD_INTERRUPTED, lck_mtx_sleep never releases the mutex and returns, eventually all the way back to bqueue_enqueue's call to cv_wait_sig. But, since the mutex was never released, no bqueue_dequeue calls will have been able to make progress, and bqueue_enqueue will repeat this process forever.
I think the correct fix here would be to modify bqueue_enqueue (and bqueue_dequeue, which has the same problem) to use the uninterruptible versions; since there's nothing they can do (without changing their API) except retry, they might as well use the uninterruptible versions.
I have not looked to see if there are other uses of cv_wait_interruptible/cv_timedwait_interruptible/cv_wait_sig/cv_timedwait_sig which have issues.
Thanks for the issue and the details of it. We pass PCATCH to msleep() so we can detect the user hitting ^C for the signal, and have the kernel parts undo. I don't recall specifically which part detects that, but I don't think it was bqueue_dequeue(), so potentially what you suggest could work.
Potentially we could also detect THREAD_INTERRUPTED and do a little manual mutex release for progress.
OK, so yes, you can definitely just change it to cv_wait() (without the _sig). But presumably upstream picked the _sig for a reason relevant to their platform.
I can also add, at the end of spl_cv_wait():
+ if (result == EINTR &&
+ (mp->m_waiters > 0 || mp->m_sleepers > 0)) {
+ printf("avoiding hang\n");
+ mutex_exit(mp);
+ (void) thread_block(THREAD_CONTINUE_NULL);
+ mutex_enter(mp);
+ }
/*
* 1 - condvar got cv_signal()/cv_broadcast()
* 0 - received signal (kill -signal)
*/
return (result == EINTR ? 0 : 1);
I added waiters, and sleepers just to avoid doing it everytime. Now to decide which fix to go with.