criu
criu copied to clipboard
Signals: non-safe usage of pr_err() in handlers
We have several handlers for signals for different cases.
At least two of them (others I didn't check) use pr_err()
function.
This is not safe not only in means of messages corruptions (which is not neat, but a minor thing), but also in means of locking.
sigchld_handler()
does:
if (exit)
pr_err("%d exited, status=%d\n", pid, status);
else
pr_err("%d killed by signal %d: %s\n",
pid, status, strsignal(status));
And usernsd_handler()
does:
if (!ns)
pr_err("Spurious pid ns helper: pid=%d\n", pid);
pr_err("%d finished unexpected: status=%d\n", pid, status);
But it looks like we have such a great feature as keeping first error message for RPC.
It's done with log_note_err()
, which does this:
mutex_lock(&first_err->l);
if (first_err->s[0] == '\0')
strlcpy(first_err->s, msg, sizeof(first_err->s));
mutex_unlock(&first_err->l);
TLDR; We may lockup in case of delivered signal while we've taken mutex in log_note_err()
as the function also is called from inside pr_err() in signal handlers.
A friendly reminder that this issue had no activity for 30 days.
A friendly reminder that this issue had no activity for 30 days.
A friendly reminder that this issue had no activity for 30 days.
My understanding of the issue is that when we call pr_err()
, we will enter the log_note_err()
function, and if log_note_err()
encounters a signal, it will try to call pr_err()
again to acquire a lock that has already been acquired. However, the first call to log_note_err()
has already been interrupted by the signal and cannot release the lock. My idea is that perhaps we can use a spin lock to modify log_note_err
. Does my idea make sense?
I want to try to solve it. However, I am concerned that this may not be an easy bug to test. What tools should I use to debug the deadlock issue in CRIU?
My understanding of the issue is that when we call
pr_err()
, we will enter thelog_note_err()
function, and iflog_note_err()
encounters a signal, it will try to callpr_err()
again to acquire a lock that has already been acquired. However, the first call tolog_note_err()
has already been interrupted by the signal and cannot release the lock. My idea is that perhaps we can use a spin lock to modifylog_note_err
. Does my idea make sense?
TLDR; Is this a 'Self deadlock' problem? And can it be solved using spin locks + disabling interrupts?
My understanding of the issue is that when we call
pr_err()
, we will enter thelog_note_err()
function, and iflog_note_err()
encounters a signal, it will try to callpr_err()
again to acquire a lock that has already been acquired. However, the first call tolog_note_err()
has already been interrupted by the signal and cannot release the lock. My idea is that perhaps we can use a spin lock to modifylog_note_err
. Does my idea make sense?TLDR; Is this a 'Self deadlock' problem? And can it be solved using spin locks + disabling interrupts?
CRIU is just a userspace process, we have no interrupts in userspace, but we have signals and preemption. More precisely, we have interrupts everywhere, but in userspace we have no control over them. So you can't disable preemption or interrupts in userspace. https://man7.org/linux/man-pages/man7/signal-safety.7.html
I want to try to solve it. However, I am concerned that this may not be an easy bug to test. What tools should I use to debug the deadlock issue in CRIU?
Here we have no reproducer, it's just an theoretical problem and we know precisely that calling pr_err inside signal handler is not safe, but we haven't got any real visible problems from that.
My understanding of the issue is that when we call
pr_err()
, we will enter thelog_note_err()
function, and iflog_note_err()
encounters a signal, it will try to callpr_err()
again to acquire a lock that has already been acquired. However, the first call tolog_note_err()
has already been interrupted by the signal and cannot release the lock. My idea is that perhaps we can use a spin lock to modifylog_note_err
. Does my idea make sense?TLDR; Is this a 'Self deadlock' problem? And can it be solved using spin locks + disabling interrupts?
CRIU is just a userspace process, we have no interrupts in userspace, but we have signals and preemption. More precisely, we have interrupts everywhere, but in userspace we have no control over them. So you can't disable preemption or interrupts in userspace. https://man7.org/linux/man-pages/man7/signal-safety.7.html
I want to try to solve it. However, I am concerned that this may not be an easy bug to test. What tools should I use to debug the deadlock issue in CRIU?
Here we have no reproducer, it's just an theoretical problem and we know precisely that calling pr_err inside signal handler is not safe, but we haven't got any real visible problems from that.
Thanks for your reply. I'm sorry I got mistakes with some concepts, but now I understand. Your explanation was very detailed and helpful.
My understanding of the issue is that when we call
pr_err()
, we will enter thelog_note_err()
function, and iflog_note_err()
encounters a signal, it will try to callpr_err()
again to acquire a lock that has already been acquired. However, the first call tolog_note_err()
has already been interrupted by the signal and cannot release the lock. My idea is that perhaps we can use a spin lock to modifylog_note_err
. Does my idea make sense?TLDR; Is this a 'Self deadlock' problem? And can it be solved using spin locks + disabling interrupts?
CRIU is just a userspace process, we have no interrupts in userspace, but we have signals and preemption. More precisely, we have interrupts everywhere, but in userspace we have no control over them. So you can't disable preemption or interrupts in userspace. https://man7.org/linux/man-pages/man7/signal-safety.7.html
I want to try to solve it. However, I am concerned that this may not be an easy bug to test. What tools should I use to debug the deadlock issue in CRIU?
Here we have no reproducer, it's just an theoretical problem and we know precisely that calling pr_err inside signal handler is not safe, but we haven't got any real visible problems from that.
Thanks for your reply. I'm sorry I got mistakes with some concepts, but now I understand. Your explanation was very detailed and helpful.
You are always welcome! :)