ppp icon indicating copy to clipboard operation
ppp copied to clipboard

chat.c deadlocks

Open anolivetree opened this issue 3 years ago • 10 comments

chat.c deadlocks when following happens.

  1. calls syslog()
  2. syslog() takes mutex (syslog_lock)
  3. While that mutex is held, signal handler is called
  4. signal handlers in chat.c calls syslog()
  5. syslog() takes mutex (syslog_lock)
  6. deadlock

It seems pppd/main.c circumvent the same problem by not calling syslog(). https://github.com/ppp-project/ppp/commit/7bb5e2cdca9445e1c9ebf3a29efc80ef9d3cbed4

anolivetree avatar Oct 14 '21 08:10 anolivetree

Interesting. Do you have a way to trigger that scenario at will? It would help in testing a fix if you can supply such a method.

paulusmack avatar Oct 16 '21 03:10 paulusmack

I'm afraid I don't have a way to trigger the problem at will.

The moble dongle I use is huawei 604HW. https://consumer.huawei.com/jp/mobile-broadband/604hw/specs/

This one notifies the signal strength asynchronously. It seems that when the timeout of AT command and receiving signal strength notification happen at the same time, deadlock occurs.

  1. send AT command
  2. receives signal strength notification. try to syslog() the signal strength
  3. alarm expires because of timeout of AT command. alarm handler calls syslog()
  4. deadlock

anolivetree avatar Oct 16 '21 14:10 anolivetree

Calling any function that is not listed as async-signal-safe, i.e. see "man 7 signal-safety" on a Linux system should not be used from a signal handler. I've encountered number of permutations where weird things happens to the program and it's stuck waiting for a mutex down in libc. Could be syslog like this instance here, could be fwrite(), tzset() (via e.g. syslog), etc. Trying to do much beyond setting a flag, or using write(fd, ..) is kind of moot from a signal handler.

enaess avatar Oct 21 '21 19:10 enaess

@anolivetree could you try to execute the chat application with a -S (capital "S"), that should turn off the "to syslog" feature. You could specify "-s" -- lowercase "s" to write to stderr and redirect that to a file. However, it will invoke the fprintf() function which is still not going to fix it for you.

Reviewing the code, it seems like making it completely fail-proof from any async-signal-unsafe function is going to be difficult. falal() calls terminate(), and it may flush the terminal and use calls to fprintf() and terminal related functions.

enaess avatar Oct 21 '21 19:10 enaess

@enaess it seems that it doesn't deadlock with -S option. I've tried connecting periorically with and without -S option. With -S, chat didn't deadlock in one hour trial. Without -S, chat deadlocked two times in 30 minutes trial.

anolivetree avatar Nov 04 '21 05:11 anolivetree

Following is a workaround which blocks signals in msgf() and fatal(). I know it's not optimal but it circumvent the deadlock.

diff --git a/chat/chat.c b/chat/chat.c
index 710dba9..eff7097 100644
--- a/chat/chat.c
+++ b/chat/chat.c
@@ -492,6 +492,23 @@ Usage: %s [-e] [-E] [-v] [-V] [-t timeout] [-r report-file]\n\

 char line[1024];

+void block_catching_signals(sigset_t *oldset)
+{
+    sigset_t set;
+
+    sigemptyset(&set);
+    sigaddset(&set, SIGALRM);
+    sigaddset(&set, SIGINT);
+    sigaddset(&set, SIGTERM);
+    sigaddset(&set, SIGHUP);
+    sigprocmask(SIG_BLOCK, &set, oldset);
+}
+
+void unblock_catching_signals(const sigset_t *oldset)
+{
+    sigprocmask(SIG_SETMASK, oldset, NULL);
+}
+
 /*
  * Send a message to syslog and/or stderr.
  */
@@ -507,11 +524,16 @@ void msgf __V((const char *fmt, ...))
     fmt = va_arg(args, char *);
 #endif

+    sigset_t oldset;
+    block_catching_signals(&oldset);
+
     vfmtmsg(line, sizeof(line), fmt, args);
     if (to_log)
    syslog(LOG_INFO, "%s", line);
     if (to_stderr)
    fprintf(stderr, "%s\n", line);
+
+    unblock_catching_signals(&oldset);
 }

 /*
@@ -532,11 +554,17 @@ void fatal __V((int code, const char *fmt, ...))
     fmt = va_arg(args, char *);
 #endif

+    sigset_t oldset;
+    block_catching_signals(&oldset);
+
     vfmtmsg(line, sizeof(line), fmt, args);
     if (to_log)
    syslog(LOG_ERR, "%s", line);
     if (to_stderr)
    fprintf(stderr, "%s\n", line);
+
+    unblock_catching_signals(&oldset);
+
     terminate(code);
 }

anolivetree avatar Nov 04 '21 07:11 anolivetree

Your patch would prevent syslog from being called again even from a signal handler given that msgf and fatal are the two functions that actually calls syslog.

  • In the unlikely event that SIGALRM or SIGINT is triggered when chat is inside a call to fprintf() in libc from signal handler is still going to deadlock, right?
  • terminate() still uses functions that are not async-signal-safe
  • if someone commits a change to call syslog a different location of the code in the future, he/she lacks the context in which he also must block these signals.

I wonder if there would be worthwhile to set a global flag and process this in the loop of the main() or do_file() function to indicate it's time to bail. Continue to use these functions from within the signal handler is still going to mask a potential future problem even if it blocks call to syslog.

Note that I am not in a position to test this. I don't have a way to reproduce this problem. Though, I appreciate you taking the time to produce a patch for it.

enaess avatar Nov 04 '21 16:11 enaess

  • In the unlikely event that SIGALRM or SIGINT is triggered when chat is inside a call to fprintf() in libc from signal handler is still going to deadlock, right?

If don't know if fpritnf takes the same futex lock as syslog(), but if it does it should deadlock.

  • if someone commits a change to call syslog a different location of the code in the future, he/she lacks the context in which he also must block these signals.

True. That's why I didn't make a pull request because my patch should not be merged into official source code.

I wonder if there would be worthwhile to set a global flag and process this in the loop of the main() or do_file() function to indicate it's time to bail.

It would be a correct way to solve the problem.

I appreciate you taking the time to produce a patch for it.

I don't know if I can prepare a patch anytime soon. If you create a patch, please let me know. I'll test it.

anolivetree avatar Nov 05 '21 00:11 anolivetree

@anolivetree: Have you looked?

Neustradamus avatar Mar 29 '22 00:03 Neustradamus

@Neustradamus I'm afraid not. I've been busy recently. I hope someone would make a fix for it.

anolivetree avatar Mar 29 '22 23:03 anolivetree

@anolivetree: Maybe you have looked since some months?

Neustradamus avatar Oct 25 '22 17:10 Neustradamus

I pushed a commit on the 'chatsig' branch in this repository that should fix this problem. Review and/or testing would be welcome.

paulusmack avatar Nov 28 '22 05:11 paulusmack

I reviewed the 1 change in the branch and it looks good to me. Hopefully if @anolivetree is able to give it a swing to test it then we should be good.

enaess avatar Nov 28 '22 17:11 enaess

@paulusmack has done a PR here:

  • https://github.com/ppp-project/ppp/pull/375

What do you think?

Neustradamus avatar Dec 06 '22 04:12 Neustradamus

I must have replied his email or something. Did review the change a while back and it looked good to me. Don't have a way to test it though, was hoping that the issue it closes has a collaborator who could help with that.

enaess avatar Dec 06 '22 05:12 enaess