ppp
ppp copied to clipboard
chat.c deadlocks
chat.c deadlocks when following happens.
- calls syslog()
- syslog() takes mutex (syslog_lock)
- While that mutex is held, signal handler is called
- signal handlers in chat.c calls syslog()
- syslog() takes mutex (syslog_lock)
- deadlock
It seems pppd/main.c circumvent the same problem by not calling syslog(). https://github.com/ppp-project/ppp/commit/7bb5e2cdca9445e1c9ebf3a29efc80ef9d3cbed4
Interesting. Do you have a way to trigger that scenario at will? It would help in testing a fix if you can supply such a method.
I'm afraid I don't have a way to trigger the problem at will.
The moble dongle I use is huawei 604HW. https://consumer.huawei.com/jp/mobile-broadband/604hw/specs/
This one notifies the signal strength asynchronously. It seems that when the timeout of AT command and receiving signal strength notification happen at the same time, deadlock occurs.
- send AT command
- receives signal strength notification. try to syslog() the signal strength
- alarm expires because of timeout of AT command. alarm handler calls syslog()
- deadlock
Calling any function that is not listed as async-signal-safe, i.e. see "man 7 signal-safety" on a Linux system should not be used from a signal handler. I've encountered number of permutations where weird things happens to the program and it's stuck waiting for a mutex down in libc. Could be syslog like this instance here, could be fwrite(), tzset() (via e.g. syslog), etc. Trying to do much beyond setting a flag, or using write(fd, ..) is kind of moot from a signal handler.
@anolivetree could you try to execute the chat application with a -S (capital "S"), that should turn off the "to syslog" feature. You could specify "-s" -- lowercase "s" to write to stderr and redirect that to a file. However, it will invoke the fprintf() function which is still not going to fix it for you.
Reviewing the code, it seems like making it completely fail-proof from any async-signal-unsafe function is going to be difficult. falal() calls terminate(), and it may flush the terminal and use calls to fprintf() and terminal related functions.
@enaess it seems that it doesn't deadlock with -S option. I've tried connecting periorically with and without -S option. With -S, chat didn't deadlock in one hour trial. Without -S, chat deadlocked two times in 30 minutes trial.
Following is a workaround which blocks signals in msgf() and fatal(). I know it's not optimal but it circumvent the deadlock.
diff --git a/chat/chat.c b/chat/chat.c
index 710dba9..eff7097 100644
--- a/chat/chat.c
+++ b/chat/chat.c
@@ -492,6 +492,23 @@ Usage: %s [-e] [-E] [-v] [-V] [-t timeout] [-r report-file]\n\
char line[1024];
+void block_catching_signals(sigset_t *oldset)
+{
+ sigset_t set;
+
+ sigemptyset(&set);
+ sigaddset(&set, SIGALRM);
+ sigaddset(&set, SIGINT);
+ sigaddset(&set, SIGTERM);
+ sigaddset(&set, SIGHUP);
+ sigprocmask(SIG_BLOCK, &set, oldset);
+}
+
+void unblock_catching_signals(const sigset_t *oldset)
+{
+ sigprocmask(SIG_SETMASK, oldset, NULL);
+}
+
/*
* Send a message to syslog and/or stderr.
*/
@@ -507,11 +524,16 @@ void msgf __V((const char *fmt, ...))
fmt = va_arg(args, char *);
#endif
+ sigset_t oldset;
+ block_catching_signals(&oldset);
+
vfmtmsg(line, sizeof(line), fmt, args);
if (to_log)
syslog(LOG_INFO, "%s", line);
if (to_stderr)
fprintf(stderr, "%s\n", line);
+
+ unblock_catching_signals(&oldset);
}
/*
@@ -532,11 +554,17 @@ void fatal __V((int code, const char *fmt, ...))
fmt = va_arg(args, char *);
#endif
+ sigset_t oldset;
+ block_catching_signals(&oldset);
+
vfmtmsg(line, sizeof(line), fmt, args);
if (to_log)
syslog(LOG_ERR, "%s", line);
if (to_stderr)
fprintf(stderr, "%s\n", line);
+
+ unblock_catching_signals(&oldset);
+
terminate(code);
}
Your patch would prevent syslog from being called again even from a signal handler given that msgf and fatal are the two functions that actually calls syslog.
- In the unlikely event that SIGALRM or SIGINT is triggered when chat is inside a call to fprintf() in libc from signal handler is still going to deadlock, right?
- terminate() still uses functions that are not async-signal-safe
- if someone commits a change to call syslog a different location of the code in the future, he/she lacks the context in which he also must block these signals.
I wonder if there would be worthwhile to set a global flag and process this in the loop of the main() or do_file() function to indicate it's time to bail. Continue to use these functions from within the signal handler is still going to mask a potential future problem even if it blocks call to syslog.
Note that I am not in a position to test this. I don't have a way to reproduce this problem. Though, I appreciate you taking the time to produce a patch for it.
- In the unlikely event that SIGALRM or SIGINT is triggered when chat is inside a call to fprintf() in libc from signal handler is still going to deadlock, right?
If don't know if fpritnf takes the same futex lock as syslog(), but if it does it should deadlock.
- if someone commits a change to call syslog a different location of the code in the future, he/she lacks the context in which he also must block these signals.
True. That's why I didn't make a pull request because my patch should not be merged into official source code.
I wonder if there would be worthwhile to set a global flag and process this in the loop of the main() or do_file() function to indicate it's time to bail.
It would be a correct way to solve the problem.
I appreciate you taking the time to produce a patch for it.
I don't know if I can prepare a patch anytime soon. If you create a patch, please let me know. I'll test it.
@anolivetree: Have you looked?
@Neustradamus I'm afraid not. I've been busy recently. I hope someone would make a fix for it.
@anolivetree: Maybe you have looked since some months?
I pushed a commit on the 'chatsig' branch in this repository that should fix this problem. Review and/or testing would be welcome.
I reviewed the 1 change in the branch and it looks good to me. Hopefully if @anolivetree is able to give it a swing to test it then we should be good.
@paulusmack has done a PR here:
- https://github.com/ppp-project/ppp/pull/375
What do you think?
I must have replied his email or something. Did review the change a while back and it looked good to me. Don't have a way to test it though, was hoping that the issue it closes has a collaborator who could help with that.