criu Signals: non-safe usage of pr

We have several handlers for signals for different cases. At least two of them (others I didn't check) use pr_err() function. This is not safe not only in means of messages corruptions (which is not neat, but a minor thing), but also in means of locking. sigchld_handler()does:

	if (exit)
		pr_err("%d exited, status=%d\n", pid, status);
	else
		pr_err("%d killed by signal %d: %s\n",
			pid, status, strsignal(status));

And usernsd_handler() does:

		if (!ns)
			pr_err("Spurious pid ns helper: pid=%d\n", pid);

		pr_err("%d finished unexpected: status=%d\n", pid, status);

But it looks like we have such a great feature as keeping first error message for RPC. It's done with log_note_err(), which does this:

		mutex_lock(&first_err->l);
		if (first_err->s[0] == '\0')
			strlcpy(first_err->s, msg, sizeof(first_err->s));
		mutex_unlock(&first_err->l);

TLDR; We may lockup in case of delivered signal while we've taken mutex in log_note_err() as the function also is called from inside pr_err() in signal handlers.

Jul 05 '17 20:07 0x7f454c46

A friendly reminder that this issue had no activity for 30 days.

Jan 20 '21 00:01 github-actions[bot]

A friendly reminder that this issue had no activity for 30 days.

Feb 20 '21 00:02 github-actions[bot]

A friendly reminder that this issue had no activity for 30 days.

Apr 03 '21 00:04 github-actions[bot]

My understanding of the issue is that when we call pr_err(), we will enter the log_note_err() function, and if log_note_err() encounters a signal, it will try to call pr_err() again to acquire a lock that has already been acquired. However, the first call to log_note_err() has already been interrupted by the signal and cannot release the lock. My idea is that perhaps we can use a spin lock to modify log_note_err. Does my idea make sense?

Mar 18 '23 07:03 Daz-3ux

I want to try to solve it. However, I am concerned that this may not be an easy bug to test. What tools should I use to debug the deadlock issue in CRIU?

Mar 18 '23 07:03 Daz-3ux

My understanding of the issue is that when we call pr_err(), we will enter the log_note_err() function, and if log_note_err() encounters a signal, it will try to call pr_err() again to acquire a lock that has already been acquired. However, the first call to log_note_err() has already been interrupted by the signal and cannot release the lock. My idea is that perhaps we can use a spin lock to modify log_note_err. Does my idea make sense?

TLDR; Is this a 'Self deadlock' problem? And can it be solved using spin locks + disabling interrupts?

Mar 18 '23 07:03 Daz-3ux

My understanding of the issue is that when we call pr_err(), we will enter the log_note_err() function, and if log_note_err() encounters a signal, it will try to call pr_err() again to acquire a lock that has already been acquired. However, the first call to log_note_err() has already been interrupted by the signal and cannot release the lock. My idea is that perhaps we can use a spin lock to modify log_note_err. Does my idea make sense?

TLDR; Is this a 'Self deadlock' problem? And can it be solved using spin locks + disabling interrupts?

CRIU is just a userspace process, we have no interrupts in userspace, but we have signals and preemption. More precisely, we have interrupts everywhere, but in userspace we have no control over them. So you can't disable preemption or interrupts in userspace. https://man7.org/linux/man-pages/man7/signal-safety.7.html

I want to try to solve it. However, I am concerned that this may not be an easy bug to test. What tools should I use to debug the deadlock issue in CRIU?

Here we have no reproducer, it's just an theoretical problem and we know precisely that calling pr_err inside signal handler is not safe, but we haven't got any real visible problems from that.

Mar 18 '23 09:03 mihalicyn

My understanding of the issue is that when we call pr_err(), we will enter the log_note_err() function, and if log_note_err() encounters a signal, it will try to call pr_err() again to acquire a lock that has already been acquired. However, the first call to log_note_err() has already been interrupted by the signal and cannot release the lock. My idea is that perhaps we can use a spin lock to modify log_note_err. Does my idea make sense?

TLDR; Is this a 'Self deadlock' problem? And can it be solved using spin locks + disabling interrupts?

CRIU is just a userspace process, we have no interrupts in userspace, but we have signals and preemption. More precisely, we have interrupts everywhere, but in userspace we have no control over them. So you can't disable preemption or interrupts in userspace. https://man7.org/linux/man-pages/man7/signal-safety.7.html

I want to try to solve it. However, I am concerned that this may not be an easy bug to test. What tools should I use to debug the deadlock issue in CRIU?

Here we have no reproducer, it's just an theoretical problem and we know precisely that calling pr_err inside signal handler is not safe, but we haven't got any real visible problems from that.

Thanks for your reply. I'm sorry I got mistakes with some concepts, but now I understand. Your explanation was very detailed and helpful.

Mar 18 '23 13:03 Daz-3ux

My understanding of the issue is that when we call pr_err(), we will enter the log_note_err() function, and if log_note_err() encounters a signal, it will try to call pr_err() again to acquire a lock that has already been acquired. However, the first call to log_note_err() has already been interrupted by the signal and cannot release the lock. My idea is that perhaps we can use a spin lock to modify log_note_err. Does my idea make sense?

TLDR; Is this a 'Self deadlock' problem? And can it be solved using spin locks + disabling interrupts?

CRIU is just a userspace process, we have no interrupts in userspace, but we have signals and preemption. More precisely, we have interrupts everywhere, but in userspace we have no control over them. So you can't disable preemption or interrupts in userspace. https://man7.org/linux/man-pages/man7/signal-safety.7.html

I want to try to solve it. However, I am concerned that this may not be an easy bug to test. What tools should I use to debug the deadlock issue in CRIU?

Here we have no reproducer, it's just an theoretical problem and we know precisely that calling pr_err inside signal handler is not safe, but we haven't got any real visible problems from that.

Thanks for your reply. I'm sorry I got mistakes with some concepts, but now I understand. Your explanation was very detailed and helpful.

You are always welcome! :)

Mar 19 '23 17:03 mihalicyn

criu
criu copied to clipboard

Signals: non-safe usage of pr_err() in handlers

criu criu copied to clipboard

Signals: non-safe usage of pr_err() in handlers

criu
criu copied to clipboard