cpp_client_telemetry
cpp_client_telemetry copied to clipboard
Make zombie loggers logic more robust
Describe your environment.
This issue is reproducible in one popular app on older models of iOS devices with slower processor.
Steps to reproduce.
Steps:
- application exiting.
- main thread is calling
FlushAndTeardown
. - at about the same time another thread is scheduled to perform logging on
ILogger
. - both clash with a deadlock in zombie logger protection code in
Logger::RecordShutdown()
method.
What is the expected behavior?
Well, it is expected that applications do not abuse the logging API that way.. At the same time we have some protection mechanism in place, to allow the safe use-after-free
. Just that protection mechanism is failing at extremely low rate, unique to the concurrent-use-during-free
.
What did you expect to see?
I expect:
- the app should avoid doing what it is doing.
- the zombie logger logic MAY be improved to handle this race condition / deadlock in zombie-logger protection code in a better way.
What is the actual behavior?
Deadlock and hang on app termination, hang in the fool-proof code that is supposed to prevent a crash due to use-after-free
. As of note, the code very reliably preventing the crash ... by hanging instead. Unfortunately that hang is eventually reported as a crash.
Additional context.
The crash rate right now is extremely low. It does not seem to affect newer devices.
I think we need to add the following stress test:
-
Initialize
/FlushTeardown
in a tight loop onLogManager
instance. - rogue thread(s) attempting to obtain loggers via
GetLogger
and log massive volumes of data Basic expectation here that the app should not crash after a 100,000 iterations like this. I am not sure if we can use some other fuzzy testing tools to artificially cause the deadlock.
Solution could be to perform timed-wait on mutex here: https://github.com/microsoft/cpp_client_telemetry/blob/a924650883ecfd44f12dba131ca117f502f372b9/lib/api/Logger.cpp#L948
And when we see that the timeout happened, we return status back, and we avoid doing anything on that ILogger
instance - discarding events that are timing out on that path.
I think that the Reinitialize
test in CI cross-plat is also accidentally tripping / hanging on that.. NOT very often, but does happen occasionally.
The issue is that the loggers map itself is a global static map, and its destructor may be called at the time when atexit
handlers are running. The force-termination scenario on iOS then may not rely on that zombie loggers map anymore.