otp
otp copied to clipboard
erts halt may crash the emulator when dirty NIFs are still running
NIFs (especially those written in C++) make heavy use of atexit
call to destroy statically created singletons. ERTS does not wait for any dirty scheduler to stop executing the NIF (for a reason - NIFs may potentially block for a long time or even forever).
When erlang:halt()
is called, dirty NIFs may be still running and accessing singletons, which have been deallocated by atexit
call. This leads to emulator dumping the core.
To Reproduce Test case in this commit: https://github.com/max-au/otp/commit/438188cf5bd6f9ae6c3eca622790a6e39d64d2ee
Expected behavior
Dirty scheduler should be stopped (may be not gracefully) before calling dlclose
or exit
. Similar code exists for async threads, but it's not implemented for dirty schedulers.
I'm a bit ambivalent regarding this, but currently I quite heavily lean towards this being an issue that the NIF should solve.
Deallocation of memory in an atexit handler seems like a waste of time to me since it will be gone anyway as soon as the runtime system has terminated, but I can imagine that there might be other more useful scenarios. However, if such code were to execute in any other multi-threaded environment it would have to synchronize accesses to such resources before termination, so I don't see why this code should not be responsible for that in this case as well. You could register an atexit handler in order to trigger this synchronization. Perhaps the runtime system could provide some functionality to make this easier though.
If we were to wait for all dirty NIFs to terminate before halt()
completes I suspect we would get bug reports of halt()
not working as expected since their runtime system wont terminate as they expected. If we could safely suspend or abruptly terminate the dirty scheduler threads it would be one thing, but I don't see how that would be possible. Dirty scheduler threads could very well have acquired locks which the atexit handlers also will aquire (for example a malloc mutex as in your example) and we would get a deadlock.
The handling of async threads came about in order to solve port IO where data had been queued internally in the port, but not yet left the emulator. I don't think that was a good solution, and I am really not fond of halt()
having to wait for all dirty NIFs to return before terminating the runtime system.
dlclose
should not be an issue. We don't call it until all threads executing in the code have returned.
@rickard-green What are your thoughts on these potential solutions?
- Having a
stop
orhalt
callback for NIFs that would be analogous to Port Driver'sstop
function whenerlang:halt()
withflush
is enabled. Additionally, if NIFs had the ability to block until all actively running dirty calls have completed, this would help to solve the issue without making it the default behavior for all NIFs. - Change
erts_exit
to call_exit
instead ofexit
, so thatatexit
callbacks are skipped entirely. - Provide an additional option to
halt()
(maybeflush_dirty
) that could be off by default and only enabled when NIF authors know they need it.
Port Drivers, on the other hand, have the stop
callback when a halt()
is triggered. As a temporary workaround, I wrote a Port Driver that calls atexit()
in the stop
callback. The callback for atexit()
then calls _exit(erts_halt_code)
in order to bypass running the remaining atexit()
callbacks.
The issue we've been running into is specifically with atexit()
running due to C++ itself having this as defined behavior for static object destructors whenever exit()
is called.
@rickard-green What are your thoughts on these potential solutions?
- Having a
stop
orhalt
callback for NIFs that would be analogous to Port Driver'sstop
function whenerlang:halt()
withflush
is enabled. Additionally, if NIFs had the ability to block until all actively running dirty calls have completed, this would help to solve the issue without making it the default behavior for all NIFs.
Possibly. I'm not sure how the API for the blocking should look though. A halt
callback could perhaps be useful for termination of any NIF threads as well which might end up in the same situation as the ongoing NIF calls.
BTW, this should be possible to implement in the NIF library itself (with current functionality) using an additional atexit handler callback (working as halt
callback) together with a prolog code snippet and an epilog code snippet in the actual NIF functions which synchronize with the atexit callback. This is what I referred to when I wrote "You could register an atexit handler in order to trigger this synchronization". The atexit handler callbacks should be called in reverse installation order, at least on posix systems, so you need to make sure to install it after installation of cleanup callbacks. This is more or less the same as what the runtime system would need to do in order to provide this functionality.
- Change
erts_exit
to call_exit
instead ofexit
, so thatatexit
callbacks are skipped entirely.
If someone is using atexit callbacks for cleanup, those NIFs would now lose their cleanup functionality.
- Provide an additional option to
halt()
(maybeflush_dirty
) that could be off by default and only enabled when NIF authors know they need it.
No, I think we should keep these details to the NIF libraries if we can.
Another alternative could be to declare with a flag for each specific NIF function in its ErlNifFunc entry that the runtime system is not allowed to halt while it is executing. I'm leaning towards that in combination with a halt
callback would be the best.
I'm not sure whether it's possible to "not allow" halting. It could be that ERTS is writing crash dump, or is halting with errorlevel set, because it's no longer possible to continue.
One other approach I can think of is forceful termination of dirty schedulers. I understand that it may potentially introduce some issues under some OS-es (e.g. I cam imagine some exotic OS not releasing the mutex owned by cancelled thread), but it still feels like a safer alternative.
I've made a PR #6370 with functionality that can be used to solve problems like this.