backward-cpp icon indicating copy to clipboard operation
backward-cpp copied to clipboard

Seems it's not working in the tbb task.

Open wendajiang opened this issue 1 year ago • 12 comments

My sample code is like this:

#include <iostream>
#include "backward.hpp"
#include <oneapi/tbb.h>

int main() {
  backward::SignalHandling sh;

  int sum = oneapi::tbb::parallel_reduce(
      oneapi::tbb::blocked_range<int>(1,101), 0,
      [](oneapi::tbb::blocked_range<int> const& r, int init) -> int {
        for (int v = r.begin(); v != r.end(); v++) {
          init += v;
        }
        char *nn = nullptr;
        nn[3] = 'a';
        return init;
      },
      [](int lhs, int rhs) -> int {
        return lhs + rhs;
      }
  );

  printf("sum: %d\n", sum);


  std::cout << "hello\n";
  return 0;
}

Look at the code char *nn = nullptr; nn[3] = 'a'; And the backward-cpp can not output the stack.

wendajiang avatar Jul 10 '23 12:07 wendajiang

What do you mean by cannot output? Please expand. What was the output you expected and what did you end up seeing instead.

Could it be that oneapi is setting its own signal handler?

bombela avatar Jul 10 '23 13:07 bombela

https://github.com/bombela/backward-cpp/issues/244#issuecomment-1209318978 I think it's likely this problem. tbb other threads continue execute, when one thread receive signal from the kernel, and other threads also receive signal(it also crash by null pointer issue). So program exit directly.

My expected is the normal stack output : image However, above code the console output nothing. image

Maybe I should write custom signalhandler class to adapt multi threads scenario, at the sig_handler begin, stop all other threads in the process, and handle the signal then.

wendajiang avatar Jul 11 '23 01:07 wendajiang

https://www.man7.org/linux/man-pages/man7/signal.7.html

Signal mask and pending signals Each thread in a process has an independent signal mask, which indicates the set of signals that the thread is currently blocking. A thread can manipulate its signal mask using pthread_sigmask(3). In a traditional single-threaded application, sigprocmask(2) can be used to manipulate the signal mask.

   A thread-directed signal is one that is targeted at a specific
   thread.  A signal may be thread-directed because it was generated
   as a consequence of executing a specific machine-language
   instruction that triggered a hardware exception (e.g., SIGSEGV
   for an invalid memory access, or SIGFPE for a math error), or
   because it was targeted at a specific thread using interfaces
   such as [tgkill(2)](https://www.man7.org/linux/man-pages/man2/tgkill.2.html) or [pthread_kill(3)](https://www.man7.org/linux/man-pages/man3/pthread_kill.3.html).

   A thread can obtain the set of signals that it currently has
   pending using [sigpending(2)](https://www.man7.org/linux/man-pages/man2/sigpending.2.html).  This set will consist of the union
   of the set of pending process-directed signals and the set of
   signals pending for the calling thread.

By default threads accept all signals. The library is most likely setting the signal mask per thread.

bombela avatar Jul 11 '23 01:07 bombela

By default threads accept all signals. The library is most likely setting the signal mask per thread.

I understand this, and try replace the sig_handler by simple system api backtrace and backtrace_symbols , it works. And I gdb the above code, single step one by one, it works.

So I think it's the problem like my comment, when multi thread program receive signal, the kernel arbitrarily selects one thread to deliver, the chosen thread trigger the sig_handler, but other threads continue to execute and crash again, it's not expected behavior.

wendajiang avatar Jul 11 '23 02:07 wendajiang

Finally, I find delete the SA_RESETHAND flag and add one recursive mutex in the sig_handler function, the problem is fixed. Please review the code, if there is better one solution of this problem (multiple threads program crash meanwhile nearly)

wendajiang avatar Jul 11 '23 06:07 wendajiang

Thanks for the code. If I understand correctly, it serializes the execution of the signal handler. In other words, the signal handler can never be executed concurrently on multiple threads, but instead, one by one. In your case, since it aborts after a SIGSEGV, only one will ever execute.

So I think it's the problem like my comment, when multi thread program receive signal, the kernel arbitrarily selects one thread to deliver, the chosen thread trigger the sig_handler, but other threads continue to execute and crash again, it's not expected behavior.

For hardware exception; like SIGSEGV; the documentation states that they are thread-directed. Which means that the signal handler will only execute on the thread that triggered the fault. The kernel doesn't randomly pick a thread here.

You mentioned that multiple threads are segfaulting at the same time. And you say that it works with your code that is serializing all invocations of the signal handler. But it also it works fine if you call backtrace directly. I wonder if the issue is concurrent execution of backward-cpp and the various libraries that it calls.

bombela avatar Jul 11 '23 09:07 bombela

For hardware exception; like SIGSEGV; the documentation states that they are thread-directed. Which means that the signal handler will only execute on the thread that triggered the fault. The kernel doesn't randomly pick a thread here.

But the strange result is, if using std::mutex , the deadlock happens.

wendajiang avatar Jul 11 '23 09:07 wendajiang

For hardware exception; like SIGSEGV; the documentation states that they are thread-directed. Which means that the signal handler will only execute on the thread that triggered the fault. The kernel doesn't randomly pick a thread here.

But the strange result is, if using std::mutex , the deadlock happens.

Sorry, it's my trying code logic error, only add std::mutex, it works correctly.

wendajiang avatar Jul 11 '23 09:07 wendajiang

Thanks for the code. If I understand correctly, it serializes the execution of the signal handler. In other words, the signal handler can never be executed concurrently on multiple threads, but instead, one by one. In your case, since it aborts after a SIGSEGV, only one will ever execute.

PS. Deleting the SA_RESETHAND flag is also important, as multiple threads crash would trigger the sig_handler not the default core dump.

And raise(sigNo) inside the sig_handler function should be deleted, for avoiding the infinitely signal handle.

wendajiang avatar Jul 11 '23 09:07 wendajiang

Got it, thank you for investigating. I will have to spend some time on this.

bombela avatar Jul 11 '23 09:07 bombela

image

Maybe deleting the SA_RESETHAND is awful, with using recursive_mutex it cause re-call signal_handler, plus using jemalloc the crash report dead lock.

wendajiang avatar Aug 11 '23 05:08 wendajiang

Also not working for me with OpenMP, might be related.

xgdgsc avatar Sep 11 '23 07:09 xgdgsc