cpp_client_telemetry icon indicating copy to clipboard operation
cpp_client_telemetry copied to clipboard

Modern lifetime + ownership in the SDK

Open larvacea opened this issue 5 years ago • 2 comments

Use after free: motivation for this proposal

Clients continue to encounter use-after-free crashes in the SDK. The most common culprit is the log manager's configuration store, but other structures sometimes cause these crashes. I would like to reduce or eliminate these crashes by moving to a more modern RAII + ownership model based on the STL smart pointers std::unique_ptr and std::shared_ptr.

Proposed object lifetime changes

  • Deprecate shared global state (the LogManager instance that is not really an instance). Possibly make the static global LogManager a facade over a singleton dynamic log manager.
  • Return std::shared_ptr smart pointers rather than naked pointers or references in the API.
  • Use std::shared_ptr internally, so for instance a logger takes a shared pointer to its log manager, and thus guarantees that the log manager will not tear down while the logger is live. Pending background tasks have a shared pointer to the structures they need, so those do not tear down until the task runs.

This is a modern paradigm for asynchronous or concurrent execution and object lifetime. In my experience, without an explicit contract between these components use-after-free is inevitable.

Pause and Teardown Contracts

As we have seen on iOS, process lifetime and state is no longer the classic desktop Linux/Windows model, and we need to accommodate these demands. In particular, we need to support the scenarios in #753 and #754. To that end, it would be delightful to give client code a reliable pause and flush mechanism, including asynchronous notification when all is idle and ready for the OS to suspend or terminate the application.

SDK state ownership

The log manager should own its configuration map and any other state pertaining to it. It should not hand out mutable access to its configuration map: many (perhaps all) configuration values should not be changed after initialization. If client code passes a configuration map in initialization, the log manager should deep-copy the supplied map to its own configuration map. Since initialization is an uncommon event we can live with the costs of deep copy (memory allocations and CPU time) and we don't need to deal with COW or similar strategies to optimize this process.

larvacea avatar Apr 07 '21 19:04 larvacea

asynchronous notification when all is idle and ready for the OS to suspend or terminate the application.

This appears to be Mobile Platform -specific callback. That code should live in the app, not in SDK? For Windows Universal - this example here illustrated how existing PauseTransmission + Flush can be used:

https://github.com/microsoft/cpp_client_telemetry/blob/7e26c3964ad19186eff2be1026eabf7e65090fff/examples/cpp/SampleCppUWP/App.xaml.cpp#L115

OnSuspending - calls Flush in this case. Ideally should be also suspending the SDK threads, such as pending HTTP uploads. Pause - could be augmented with an optional parameter, to make the Pause call synchronous, i.e. ensure that all outstanding requests have been aborted. Basically, Pause - having logic similar to FlushAndTeardown (commit or timeout, using configurable interval), bringing LogManager instance into "safely suspended" state without destroying the instance.

Conceptually what you are proposing is similar to this ask: #271 - that is settling the SDK in a state where it does not know if it's gonna resume or not, but suspends whatever activities, HTTP stack, running threads, ensuring that all pending data have been committed to storage. This, however, also implies that the customers should not try to continue using the ILogger instances when the SDK is in suspended state. Because emitting data may be unsafe, i.e. we are done flushing and we don't really know if that data they are still emitting is gonna be reliably persisted. Most frequent issue is that the customers continue using ILogger when the app is suspending. It's the app itself that needs to orchestrate / coordinate that usage of loggers. Since usually the app layer (not SDK) firsthand knows that it is being suspended, in the app OS-specific callback code, i.e. - app gets suspending or terminating notification. And the app should consolidate / join whatever threads that it is running.

The most common culprit is the log manager's configuration store..

I don't think it's a culprit per se. Current design assumes that every LogManagerX singleton instance operates on ILogConfiguration configuration object that is (usually) - global static. And the configuration object - including all references to const char * that it carries within the map, always outlives the LogManagerX. Also the LogManager has Configure method (reserved for future use) that should allow to safely reconfigure the instance in a thread-safe manner, should the config object contents change: https://github.com/microsoft/cpp_client_telemetry/blob/a924650883ecfd44f12dba131ca117f502f372b9/lib/include/public/LogManagerBase.hpp#L670

The log manager should deep-copy the supplied map to its own configuration map

I agree with this. This is probably the only immediate change needed to alleviate the issues. That way the customers won't have to care much about managing the lifecycle of the original configuration object that they are passing to SDK by-reference. This improvement would require a transform-copy of const char * to std::string here: https://github.com/microsoft/cpp_client_telemetry/blob/a924650883ecfd44f12dba131ca117f502f372b9/lib/include/public/LogManagerBase.hpp#L236

Then once copied, the customer may continue doing frivolous things on original config object, such as destroying it, or destroying individual string values of it. C-string values are probably the most problematic ones, since SDK assumes they are there for the entire duration of the process run. But in reality - these C-string values may be just pointers to temporary Obj-C string objects, that may go out-of-scope and subsequently cause some grief to SDK, that presently assumes that no one is gonna ever pull the config object values from under it prior to calling FlushAndTeardown.

maxgolov avatar Apr 12 '21 20:04 maxgolov

I'd suggest we use this issue to work on transition of API return values from raw pointers to smart pointers.

Top candidate methods for immediate refactor:

  • virtual ILogger* GetLogger(...) - should use shared_ptr to represent partial ownership semantics. Different components may obtain the same logger.

  • LogSessionData* GetLogSessionData() - should return a reference, with methods on the object providing thread-safe update mechanisms.

  • ILogController* GetLogController(void) - should return a reference.

  • IAuthTokensController* GetAuthTokensController() - should return a reference.

Maybe we can start with an approach that takes the current LogManagerBase.hpp , moves it MAT_NS_v4 namespace. And implements the necessary smart pointer API. At the same time making the LogManager an instance object instead of a singleton. Implementing API changes, to make API look good - before further refactoring the rest of core impl under the hood.

maxgolov avatar Jun 29 '21 18:06 maxgolov