ng-monitoring
ng-monitoring copied to clipboard
Continuous Profiling should be disabled by default until tiflash can be safely profiled
See: https://github.com/pingcap/tiflash/issues/5687
Continuous Profiling has been enabled by default since v6.1.0, so TiDB/TiKV/TiFlash will continue to trigger CPU profiling by default since v6.1.0. The CPU profiler of TiKV/TiFlash is pprof-rs. pprof-rs registers a signal handler during CPU profiling and periodically triggers the SIGPROF signal. After the SIGPROF signal is triggered, the signal handler is dispatched to the business thread for execution to sample the call stack of the current thread. Signal handler also makes system calls through glibc during execution, so it has a chance to affect errno. After the signal handler is executed, the business thread may get the wrong errno. However, errno has been protected in the current version of pprof-rs, so there is still no key evidence that the errno was modified by pprof-rs. The current judgment is based on two facts:
- The problem only occurs while profiling is on
- pprof-rs has a very similar issue
So, Continuous Profiling should be disabled by default until tiflash can be safely profiled.
/assign
Is it possible to avoid "Signal handler also makes system calls"?
Is it possible to avoid "Signal handler also makes system calls"?
There are currently two choices for stack backtracking: libunwind-based and frame-pointer-based.
In the past we were based on libunwind by default. In this case, it is unavoidable to affect errno. Here is an example:
fn main() {
unsafe {
println!("before backtrace: {}", *libc::__errno_location());
}
let _ = backtrace::Backtrace::new();
unsafe {
println!("after backtrace: {}", *libc::__errno_location());
}
}
The output is:
before backtrace: 0
after backtrace: 2
Now we are based on frame-pointer by default, which ideally does not need to depend on system calls. Unfortunately, we cannot guarantee that all dynamic link libraries have frame pointers turned on, so a protection mechanism needs to be introduced. This protection mechanism needs to depend on system calls.
However, since the root cause of the problem on profiling on TiFlash is still unclear, we cannot determine whether the signal handler affects errno for the same reason.
And, I may try to introduce eBPF in the near future to improve the performance and stability of profiling, refer to https://github.com/tikv/tikv/issues/13336.