cpp_client_telemetry Make zombie loggers logic more robust

Describe your environment.

This issue is reproducible in one popular app on older models of iOS devices with slower processor.

Steps to reproduce.

Steps:

application exiting.
main thread is calling FlushAndTeardown.
at about the same time another thread is scheduled to perform logging on ILogger.
both clash with a deadlock in zombie logger protection code in Logger::RecordShutdown() method.

What is the expected behavior?

Well, it is expected that applications do not abuse the logging API that way.. At the same time we have some protection mechanism in place, to allow the safe use-after-free. Just that protection mechanism is failing at extremely low rate, unique to the concurrent-use-during-free.

What did you expect to see?

I expect:

the app should avoid doing what it is doing.
the zombie logger logic MAY be improved to handle this race condition / deadlock in zombie-logger protection code in a better way.

What is the actual behavior?

Deadlock and hang on app termination, hang in the fool-proof code that is supposed to prevent a crash due to use-after-free. As of note, the code very reliably preventing the crash ... by hanging instead. Unfortunately that hang is eventually reported as a crash.

Additional context.

The crash rate right now is extremely low. It does not seem to affect newer devices.

I think we need to add the following stress test:

Initialize / FlushTeardown in a tight loop on LogManager instance.
rogue thread(s) attempting to obtain loggers via GetLogger and log massive volumes of data Basic expectation here that the app should not crash after a 100,000 iterations like this. I am not sure if we can use some other fuzzy testing tools to artificially cause the deadlock.

Solution could be to perform timed-wait on mutex here: https://github.com/microsoft/cpp_client_telemetry/blob/a924650883ecfd44f12dba131ca117f502f372b9/lib/api/Logger.cpp#L948

And when we see that the timeout happened, we return status back, and we avoid doing anything on that ILogger instance - discarding events that are timing out on that path.

Apr 29 '21 22:04 maxgolov

I think that the Reinitialize test in CI cross-plat is also accidentally tripping / hanging on that.. NOT very often, but does happen occasionally.

May 06 '21 19:05 maxgolov

The issue is that the loggers map itself is a global static map, and its destructor may be called at the time when atexit handlers are running. The force-termination scenario on iOS then may not rely on that zombie loggers map anymore.

Jul 02 '21 20:07 maxgolov

cpp_client_telemetry cpp_client_telemetry copied to clipboard

Make zombie loggers logic more robust

cpp_client_telemetry
cpp_client_telemetry copied to clipboard