deltachat-core-rust icon indicating copy to clipboard operation
deltachat-core-rust copied to clipboard

Event loop stopping is not thread-safe

Open link2xt opened this issue 4 years ago • 8 comments

Common usage of deltachat-ffi API, as described in the doxygen documentation, is to start an OS thread with event loop, pass a dc_context_t pointer to it and let it process events until dc_get_next_event returns NULL.

dc_get_next_event only returns NULL when the context is unreferenced, so when main thread wants to stop the event loop, e.g. because it received a SIGTERM or user closed the window, it has to call dc_context_unref. However, if event loop thread is in the middle of reacting to an event, it may use the context passed to it, which may be unreferenced and deallocated by the main thread at the same time.

One way to fix it on the FFI user side is to put the dc_context_t pointer behind mutex and lock it whenever an event arrives for the whole duration of event processing. Main thread wanting to unreference the context will have to lock the mutex, call dc_context_unref and set the pointer to NULL then, so event loop thread will discover that context was unreferenced after locking the mutex and avoid trying to use it. This is still a problematic solution, because remaining events will not be processed.

So a new API should be designed, such as dc_context_terminate function that stops I/O, ongoing processes and lets the event queue drain. After calling it, context should remain usable, so event loop thread should be able to queue messages in response to events remaining in the queue etc., but will exit eventually because dc_get_next_event will return NULL when all connections are closed and no more events are expected. After that, main thread should notice that event loop thread has terminated and unreference the context or let the event loop thread unreference it right before exiting.

With the current API and it's usage there is always a chance event loop thread will use unreferenced context and crash the app during the attempt to exit cleanly.

link2xt avatar Mar 04 '21 12:03 link2xt

Another way to fix this in backward-compatible way is to guarantee that context is never actually deallocated until you drain the queue and finish processing the last event. So Event, whether staying in the queue or already handed over to the user but not deallocated via dc_event_unref should hold a "weak" reference to context, which will prevent it from deallocating but not from terminating.

This will still break if event processing results in some UI events, like Qt events, being scheduled and processed asynchronously in the main thread. They will try to use deallocated context then unless dc_event_t ownership is passed to them.

For this reason I still prefer API-breaking solution of requiring explicit termination so UI can keep it's own dc_context_t pointer without preventing termination of the event loop.

link2xt avatar Mar 21 '21 11:03 link2xt

Doesn't the Rust API have an Arc<InnerContext> for this reason? The event processor can hang on to the inner context as long as it needs and when it gets a NULL event drop that. Now this isn't exposed to the FFI, but I think it's better to allow this mechanism rather than have to introduce a new mutex. I've not properly thought this through but maybe a dc_context_clone which creates a new dc_context_t which contains an arc of the same inner context?

flub avatar Mar 27 '21 12:03 flub

A simple way to fix this: add thread-safe dc_event_emitter_close(dc_event_emitter_t* emitter). This will make event emitter to start returning NULL afterwards.

To terminate event loop from any thread, you close the event emitter. Event loop stops and at the end of event loop dc_event_emitter_unref() is called. Then main thread joins the event loop thread and calls dc_context_unref().

Users will have to replace all existing dc_context_unref() calls with dc_event_emitter_close() and add a dc_context_unref() somewhere before exit to free memory.

link2xt avatar Dec 06 '21 11:12 link2xt

A simple way to fix this: add thread-safe dc_event_emitter_close(dc_event_emitter_t* emitter). This will make event emitter to start returning NULL afterwards.

To terminate event loop from any thread, you close the event emitter. Event loop stops and at the end of event loop dc_event_emitter_unref() is called. Then main thread joins the event loop thread and calls dc_context_unref().

Users will have to replace all existing dc_context_unref() calls with dc_event_emitter_close() and add a dc_context_unref() somewhere before exit to free memory.

this sounds like a good way to try improving. Are you up for implementing? i'd do the python part.

hpk42 avatar May 03 '22 17:05 hpk42

I opened a draft PR at #3285

link2xt avatar May 03 '22 18:05 link2xt

In Python bindings a safe approach was implemented in #3289. Main thread manually signals to event loop that it should stop instead of relying on dc_get_next_event returning NULL, and then calls dc_stop_io() to interrupt blocking call to dc_get_next_event inside event loop.

link2xt avatar May 17 '22 08:05 link2xt

Python fix: #3289 Node fix: #3430

I think this can be closed once this approach is documented in Doxygen.

link2xt avatar Jun 13 '22 20:06 link2xt

we should probably then open an issue for android/ios, so that it is fixed/checked there as well.

EDIT link2xt: iOS https://github.com/deltachat/deltachat-ios/issues/1830 Android: https://github.com/deltachat/deltachat-android/issues/2515

r10s avatar Jun 14 '22 07:06 r10s

I think this can be closed once this approach is documented in Doxygen.

I have opened a PR #4233 adding documentation to Doxygen.

link2xt avatar Mar 26 '23 17:03 link2xt