dlt-daemon
dlt-daemon copied to clipboard
Fix all processes using libdlt on Android terminating with SIGKILL
Android does not support pthread_cancel(), and never will, see https://android.googlesource.com/platform/bionic/+/master/docs/status.md . dlt_stop_threads() therefor tries to use pthread_kill() with SIGKILL to stop the housekeeping thread. This does not work as intended though. It forcefully kills the whole process, resulting in all processes using DLT on Android being terminated with SIGKILL.
Fix by using a pipe instead to wake housekeeping thread from poll() when it is time to exit. There are better mechanisms available on Linux, e.g. eventfd(), but use a pipe to limit differences between different platforms. pipe2() is used on Linux instead of pipe() to prevent O_CLOEXEC/execv() race in forking processes (_GNU_SOURCE needs to be defined on Android).
Even though DLT_NETWORK_TRACE_ENABLE is not supported on Android (no message queue implementation), make it so pthread_cancel() is not used there either by using a simple boolean variable. Allows for complete removal of dlt_user_cleanup_handler().
Modifications in dlt_user_log_check_user_message() look larger than they are due to removal of "if (fd != DLT_FD_INIT.." block. Want to call poll regardless now to always poll pipe. pollfd events with fd < 0 are ignored, so it is OK to always pass the pollfd entry for incoming messages.
Should also note that prior to this commit, poll() was called with DLT_USER_RECEIVE_NDELAY which is in ns while poll() expects ms. So timeout passed to poll() was (500 * 10^(6-3) / 3600)s ~= 138.8h when building with FIFO for local IPC. Might as well always pass -1.
Sorry for the many updates. I should have checked .travis.yml earlier to see what was actually done when checking pr, would have minimized updates a bit. All tests should pass now though.
Rebased to latest master.
@martin-ejdestig Sorry for my late response. Actually we have seen the same issue and have a solution but with different way to terminate thread via SIGUSR1. Can you take a look into following commit if this also solves? https://github.com/GENIVI/dlt-daemon/pull/279/commits/fae6bf7608adcf61f71c3240d2e96c8f64135642
Sure, I can have a look.
I just checked quickly and wonder a bit if using SIGUSR1 is safe. E.g. if the application uses SIGUSR1 for something it can still be delivered to DLT's housekeeping thread?
Hello @martin-ejdestig ,
I would agree with you that it could be a case if dlt application could also define SIGUSR1 itself. I think this implementation should be stated clearly in documentation so that dlt application could be aware of that.
And for your information, there are some links that I referred to: https://stackoverflow.com/questions/24891552/how-to-kill-a-child-thread-in-c-android-ndk https://gist.github.com/dram/5683646
Thanh
- I think it would be better if a library does not "occupy"
SIGUSR1. Nothing for application authors to care about. No documentation to read. :) And what do you do if you need to useSIGUSR1, say e.g. another library uses it. House keeping thread can be unexplicably stopped "randomly". There is no way to catch this until you notice problems or happen to read documentation/code. (SO post you linked to even says Dalvik usesSIGUSR1, soSIGUSR2needs to be used for problem poster had. Not relevant here perhaps, but just as an example of "problem" when lib/framework useSIGUSR1. :) - Exiting is also not "immediate" with a
poll()timeout. (Perhaps not relevant in practice.) pthread_cancel()is really not a good API (see e.g. the reasons why it will never be implemented in Android).- Different code path on Android compared to other Linux systems, so it will not be exercised as much.
For those reasons, I do not think using SIGUSR1, nor pthread_cancel() for that matter, is a good idea. But in practice, it maybe does not matter, since DLT on Android is not widely used and perhaps never will be except for as in dlt-logd-converter.
But I do not know. I "vote" for not using SIGUSR1 but it is up to @ssugiura to decide, of course. :)
@martin-ejdestig Yes, I feel your points as valid. Though as you mentioned, the library shouldn't be used from Android apps in principle; only dlt-logd-converter should be interested.
To keep the side effects (let's say Linux, QNX, or other variants) minimum, I would like to merge https://github.com/GENIVI/dlt-daemon/commit/fae6bf7608adcf61f71c3240d2e96c8f64135642 for now. I hope that's okay for you. Thanks a lot for your contribution anyways!
OK. Do you want me to close this pr or do you want me to rebase it after https://github.com/GENIVI/dlt-daemon/commit/fae6bf7608adcf61f71c3240d2e96c8f64135642 has been merged?
I am thinking that it perhaps makes sense to remove use of pthread_cancel() (and SIGUSR1) and align thread cleanup on all platforms after a release has been done or something (if you are worried about side effects).
@martin-ejdestig I feel it's reasonable to rebase this change after new release (I plan for this week) so that we have common solution for various variants.
Hello @martin-ejdestig , do you still have interest in getting this one merged? If so, please rebase it and we will work on merging it from there. Otherwise we will close your PR.
Kind regards, Michael
Well, I do not care personally. (And I no longer work for company for which patch was made.)
Feel free to rebase and merge yourself. Or close and keep hogging SIGUSR1... or whatever is done now on master.