dlt-daemon icon indicating copy to clipboard operation
dlt-daemon copied to clipboard

Fix all processes using libdlt on Android terminating with SIGKILL

Open martin-ejdestig opened this issue 5 years ago • 10 comments
trafficstars

Android does not support pthread_cancel(), and never will, see https://android.googlesource.com/platform/bionic/+/master/docs/status.md . dlt_stop_threads() therefor tries to use pthread_kill() with SIGKILL to stop the housekeeping thread. This does not work as intended though. It forcefully kills the whole process, resulting in all processes using DLT on Android being terminated with SIGKILL.

Fix by using a pipe instead to wake housekeeping thread from poll() when it is time to exit. There are better mechanisms available on Linux, e.g. eventfd(), but use a pipe to limit differences between different platforms. pipe2() is used on Linux instead of pipe() to prevent O_CLOEXEC/execv() race in forking processes (_GNU_SOURCE needs to be defined on Android).

Even though DLT_NETWORK_TRACE_ENABLE is not supported on Android (no message queue implementation), make it so pthread_cancel() is not used there either by using a simple boolean variable. Allows for complete removal of dlt_user_cleanup_handler().

Modifications in dlt_user_log_check_user_message() look larger than they are due to removal of "if (fd != DLT_FD_INIT.." block. Want to call poll regardless now to always poll pipe. pollfd events with fd < 0 are ignored, so it is OK to always pass the pollfd entry for incoming messages.

Should also note that prior to this commit, poll() was called with DLT_USER_RECEIVE_NDELAY which is in ns while poll() expects ms. So timeout passed to poll() was (500 * 10^(6-3) / 3600)s ~= 138.8h when building with FIFO for local IPC. Might as well always pass -1.

martin-ejdestig avatar Nov 10 '20 15:11 martin-ejdestig

Sorry for the many updates. I should have checked .travis.yml earlier to see what was actually done when checking pr, would have minimized updates a bit. All tests should pass now though.

martin-ejdestig avatar Nov 12 '20 13:11 martin-ejdestig

Rebased to latest master.

martin-ejdestig avatar Dec 14 '20 11:12 martin-ejdestig

@martin-ejdestig Sorry for my late response. Actually we have seen the same issue and have a solution but with different way to terminate thread via SIGUSR1. Can you take a look into following commit if this also solves? https://github.com/GENIVI/dlt-daemon/pull/279/commits/fae6bf7608adcf61f71c3240d2e96c8f64135642

ssugiura avatar Dec 15 '20 00:12 ssugiura

Sure, I can have a look.

I just checked quickly and wonder a bit if using SIGUSR1 is safe. E.g. if the application uses SIGUSR1 for something it can still be delivered to DLT's housekeeping thread?

martin-ejdestig avatar Dec 15 '20 08:12 martin-ejdestig

Hello @martin-ejdestig ,

I would agree with you that it could be a case if dlt application could also define SIGUSR1 itself. I think this implementation should be stated clearly in documentation so that dlt application could be aware of that.

And for your information, there are some links that I referred to: https://stackoverflow.com/questions/24891552/how-to-kill-a-child-thread-in-c-android-ndk https://gist.github.com/dram/5683646

Thanh

thanhbnq avatar Dec 15 '20 09:12 thanhbnq

  • I think it would be better if a library does not "occupy" SIGUSR1. Nothing for application authors to care about. No documentation to read. :) And what do you do if you need to use SIGUSR1, say e.g. another library uses it. House keeping thread can be unexplicably stopped "randomly". There is no way to catch this until you notice problems or happen to read documentation/code. (SO post you linked to even says Dalvik uses SIGUSR1, so SIGUSR2 needs to be used for problem poster had. Not relevant here perhaps, but just as an example of "problem" when lib/framework use SIGUSR1. :)
  • Exiting is also not "immediate" with a poll() timeout. (Perhaps not relevant in practice.)
  • pthread_cancel() is really not a good API (see e.g. the reasons why it will never be implemented in Android).
  • Different code path on Android compared to other Linux systems, so it will not be exercised as much.

For those reasons, I do not think using SIGUSR1, nor pthread_cancel() for that matter, is a good idea. But in practice, it maybe does not matter, since DLT on Android is not widely used and perhaps never will be except for as in dlt-logd-converter.

But I do not know. I "vote" for not using SIGUSR1 but it is up to @ssugiura to decide, of course. :)

martin-ejdestig avatar Dec 15 '20 10:12 martin-ejdestig

@martin-ejdestig Yes, I feel your points as valid. Though as you mentioned, the library shouldn't be used from Android apps in principle; only dlt-logd-converter should be interested. To keep the side effects (let's say Linux, QNX, or other variants) minimum, I would like to merge https://github.com/GENIVI/dlt-daemon/commit/fae6bf7608adcf61f71c3240d2e96c8f64135642 for now. I hope that's okay for you. Thanks a lot for your contribution anyways!

ssugiura avatar Dec 18 '20 01:12 ssugiura

OK. Do you want me to close this pr or do you want me to rebase it after https://github.com/GENIVI/dlt-daemon/commit/fae6bf7608adcf61f71c3240d2e96c8f64135642 has been merged?

martin-ejdestig avatar Dec 18 '20 08:12 martin-ejdestig

I am thinking that it perhaps makes sense to remove use of pthread_cancel() (and SIGUSR1) and align thread cleanup on all platforms after a release has been done or something (if you are worried about side effects).

martin-ejdestig avatar Dec 18 '20 09:12 martin-ejdestig

@martin-ejdestig I feel it's reasonable to rebase this change after new release (I plan for this week) so that we have common solution for various variants.

ssugiura avatar Dec 21 '20 08:12 ssugiura

Hello @martin-ejdestig , do you still have interest in getting this one merged? If so, please rebase it and we will work on merging it from there. Otherwise we will close your PR.

Kind regards, Michael

michael-methner avatar Oct 12 '22 08:10 michael-methner

Well, I do not care personally. (And I no longer work for company for which patch was made.)

Feel free to rebase and merge yourself. Or close and keep hogging SIGUSR1... or whatever is done now on master.

martin-ejdestig avatar Oct 12 '22 15:10 martin-ejdestig