ng-monitoring icon indicating copy to clipboard operation
ng-monitoring copied to clipboard

Continuous Profiling should be disabled by default until tiflash can be safely profiled

Open mornyx opened this issue 3 years ago • 3 comments

See: https://github.com/pingcap/tiflash/issues/5687

Continuous Profiling has been enabled by default since v6.1.0, so TiDB/TiKV/TiFlash will continue to trigger CPU profiling by default since v6.1.0. The CPU profiler of TiKV/TiFlash is pprof-rs. pprof-rs registers a signal handler during CPU profiling and periodically triggers the SIGPROF signal. After the SIGPROF signal is triggered, the signal handler is dispatched to the business thread for execution to sample the call stack of the current thread. Signal handler also makes system calls through glibc during execution, so it has a chance to affect errno. After the signal handler is executed, the business thread may get the wrong errno. However, errno has been protected in the current version of pprof-rs, so there is still no key evidence that the errno was modified by pprof-rs. The current judgment is based on two facts:

  1. The problem only occurs while profiling is on
  2. pprof-rs has a very similar issue

So, Continuous Profiling should be disabled by default until tiflash can be safely profiled.

mornyx avatar Aug 25 '22 09:08 mornyx

/assign

mornyx avatar Aug 25 '22 09:08 mornyx

Is it possible to avoid "Signal handler also makes system calls"?

breezewish avatar Aug 26 '22 03:08 breezewish

Is it possible to avoid "Signal handler also makes system calls"?

There are currently two choices for stack backtracking: libunwind-based and frame-pointer-based.

In the past we were based on libunwind by default. In this case, it is unavoidable to affect errno. Here is an example:

fn main() {
    unsafe {
        println!("before backtrace: {}", *libc::__errno_location());
    }
    let _ = backtrace::Backtrace::new();
    unsafe {
        println!("after backtrace: {}", *libc::__errno_location());
    }
}

The output is:

before backtrace: 0
after backtrace: 2

Now we are based on frame-pointer by default, which ideally does not need to depend on system calls. Unfortunately, we cannot guarantee that all dynamic link libraries have frame pointers turned on, so a protection mechanism needs to be introduced. This protection mechanism needs to depend on system calls.

However, since the root cause of the problem on profiling on TiFlash is still unclear, we cannot determine whether the signal handler affects errno for the same reason.

And, I may try to introduce eBPF in the near future to improve the performance and stability of profiling, refer to https://github.com/tikv/tikv/issues/13336.

mornyx avatar Aug 29 '22 04:08 mornyx