deltachat-core-rust
deltachat-core-rust copied to clipboard
Event loop stopping is not thread-safe
Common usage of deltachat-ffi API, as described in the doxygen documentation, is to start an OS thread with event loop, pass a dc_context_t pointer to it and let it process events until dc_get_next_event returns NULL.
dc_get_next_event only returns NULL when the context is unreferenced, so when main thread wants to stop the event loop, e.g. because it received a SIGTERM or user closed the window, it has to call dc_context_unref. However, if event loop thread is in the middle of reacting to an event, it may use the context passed to it, which may be unreferenced and deallocated by the main thread at the same time.
One way to fix it on the FFI user side is to put the dc_context_t pointer behind mutex and lock it whenever an event arrives for the whole duration of event processing. Main thread wanting to unreference the context will have to lock the mutex, call dc_context_unref and set the pointer to NULL then, so event loop thread will discover that context was unreferenced after locking the mutex and avoid trying to use it. This is still a problematic solution, because remaining events will not be processed.
So a new API should be designed, such as dc_context_terminate function that stops I/O, ongoing processes and lets the event queue drain. After calling it, context should remain usable, so event loop thread should be able to queue messages in response to events remaining in the queue etc., but will exit eventually because dc_get_next_event will return NULL when all connections are closed and no more events are expected. After that, main thread should notice that event loop thread has terminated and unreference the context or let the event loop thread unreference it right before exiting.
With the current API and it's usage there is always a chance event loop thread will use unreferenced context and crash the app during the attempt to exit cleanly.
Another way to fix this in backward-compatible way is to guarantee that context is never actually deallocated until you drain the queue and finish processing the last event. So Event, whether staying in the queue or already handed over to the user but not deallocated via dc_event_unref should hold a "weak" reference to context, which will prevent it from deallocating but not from terminating.
This will still break if event processing results in some UI events, like Qt events, being scheduled and processed asynchronously in the main thread. They will try to use deallocated context then unless dc_event_t ownership is passed to them.
For this reason I still prefer API-breaking solution of requiring explicit termination so UI can keep it's own dc_context_t pointer without preventing termination of the event loop.
Doesn't the Rust API have an Arc<InnerContext> for this reason? The event processor can hang on to the inner context as long as it needs and when it gets a NULL event drop that. Now this isn't exposed to the FFI, but I think it's better to allow this mechanism rather than have to introduce a new mutex. I've not properly thought this through but maybe a dc_context_clone which creates a new dc_context_t which contains an arc of the same inner context?
A simple way to fix this: add thread-safe dc_event_emitter_close(dc_event_emitter_t* emitter). This will make event emitter to start returning NULL afterwards.
To terminate event loop from any thread, you close the event emitter. Event loop stops and at the end of event loop dc_event_emitter_unref() is called. Then main thread joins the event loop thread and calls dc_context_unref().
Users will have to replace all existing dc_context_unref() calls with dc_event_emitter_close() and add a dc_context_unref() somewhere before exit to free memory.
A simple way to fix this: add thread-safe
dc_event_emitter_close(dc_event_emitter_t* emitter). This will make event emitter to start returningNULLafterwards.To terminate event loop from any thread, you close the event emitter. Event loop stops and at the end of event loop
dc_event_emitter_unref()is called. Then main thread joins the event loop thread and callsdc_context_unref().Users will have to replace all existing
dc_context_unref()calls withdc_event_emitter_close()and add adc_context_unref()somewhere before exit to free memory.
this sounds like a good way to try improving. Are you up for implementing? i'd do the python part.
I opened a draft PR at #3285
In Python bindings a safe approach was implemented in #3289. Main thread manually signals to event loop that it should stop instead of relying on dc_get_next_event returning NULL, and then calls dc_stop_io() to interrupt blocking call to dc_get_next_event inside event loop.
Python fix: #3289 Node fix: #3430
I think this can be closed once this approach is documented in Doxygen.
we should probably then open an issue for android/ios, so that it is fixed/checked there as well.
EDIT link2xt: iOS https://github.com/deltachat/deltachat-ios/issues/1830 Android: https://github.com/deltachat/deltachat-android/issues/2515
I think this can be closed once this approach is documented in Doxygen.
I have opened a PR #4233 adding documentation to Doxygen.