otp erts halt may crash the emulator when dirty NIFs are still running

NIFs (especially those written in C++) make heavy use of atexit call to destroy statically created singletons. ERTS does not wait for any dirty scheduler to stop executing the NIF (for a reason - NIFs may potentially block for a long time or even forever). When erlang:halt() is called, dirty NIFs may be still running and accessing singletons, which have been deallocated by atexit call. This leads to emulator dumping the core.

To Reproduce Test case in this commit: https://github.com/max-au/otp/commit/438188cf5bd6f9ae6c3eca622790a6e39d64d2ee

Expected behavior Dirty scheduler should be stopped (may be not gracefully) before calling dlclose or exit. Similar code exists for async threads, but it's not implemented for dirty schedulers.

Oct 27 '21 05:10 max-au

I'm a bit ambivalent regarding this, but currently I quite heavily lean towards this being an issue that the NIF should solve.

Deallocation of memory in an atexit handler seems like a waste of time to me since it will be gone anyway as soon as the runtime system has terminated, but I can imagine that there might be other more useful scenarios. However, if such code were to execute in any other multi-threaded environment it would have to synchronize accesses to such resources before termination, so I don't see why this code should not be responsible for that in this case as well. You could register an atexit handler in order to trigger this synchronization. Perhaps the runtime system could provide some functionality to make this easier though.

If we were to wait for all dirty NIFs to terminate before halt() completes I suspect we would get bug reports of halt() not working as expected since their runtime system wont terminate as they expected. If we could safely suspend or abruptly terminate the dirty scheduler threads it would be one thing, but I don't see how that would be possible. Dirty scheduler threads could very well have acquired locks which the atexit handlers also will aquire (for example a malloc mutex as in your example) and we would get a deadlock.

The handling of async threads came about in order to solve port IO where data had been queued internally in the port, but not yet left the emulator. I don't think that was a good solution, and I am really not fond of halt() having to wait for all dirty NIFs to return before terminating the runtime system.

dlclose should not be an issue. We don't call it until all threads executing in the code have returned.

Nov 01 '21 20:11 rickard-green

@rickard-green What are your thoughts on these potential solutions?

Having a stop or halt callback for NIFs that would be analogous to Port Driver's stop function when erlang:halt() with flush is enabled. Additionally, if NIFs had the ability to block until all actively running dirty calls have completed, this would help to solve the issue without making it the default behavior for all NIFs.
Change erts_exit to call _exit instead of exit, so that atexit callbacks are skipped entirely.
Provide an additional option to halt() (maybe flush_dirty) that could be off by default and only enabled when NIF authors know they need it.

Port Drivers, on the other hand, have the stop callback when a halt() is triggered. As a temporary workaround, I wrote a Port Driver that calls atexit() in the stop callback. The callback for atexit() then calls _exit(erts_halt_code) in order to bypass running the remaining atexit() callbacks.

The issue we've been running into is specifically with atexit() running due to C++ itself having this as defined behavior for static object destructors whenever exit() is called.

Nov 01 '21 22:11 potatosalad

@rickard-green What are your thoughts on these potential solutions?

Having a stop or halt callback for NIFs that would be analogous to Port Driver's stop function when erlang:halt() with flush is enabled. Additionally, if NIFs had the ability to block until all actively running dirty calls have completed, this would help to solve the issue without making it the default behavior for all NIFs.

Possibly. I'm not sure how the API for the blocking should look though. A halt callback could perhaps be useful for termination of any NIF threads as well which might end up in the same situation as the ongoing NIF calls.

BTW, this should be possible to implement in the NIF library itself (with current functionality) using an additional atexit handler callback (working as halt callback) together with a prolog code snippet and an epilog code snippet in the actual NIF functions which synchronize with the atexit callback. This is what I referred to when I wrote "You could register an atexit handler in order to trigger this synchronization". The atexit handler callbacks should be called in reverse installation order, at least on posix systems, so you need to make sure to install it after installation of cleanup callbacks. This is more or less the same as what the runtime system would need to do in order to provide this functionality.

Change erts_exit to call _exit instead of exit, so that atexit callbacks are skipped entirely.

If someone is using atexit callbacks for cleanup, those NIFs would now lose their cleanup functionality.

Provide an additional option to halt() (maybe flush_dirty) that could be off by default and only enabled when NIF authors know they need it.

No, I think we should keep these details to the NIF libraries if we can.

Another alternative could be to declare with a flag for each specific NIF function in its ErlNifFunc entry that the runtime system is not allowed to halt while it is executing. I'm leaning towards that in combination with a halt callback would be the best.

Nov 02 '21 01:11 rickard-green

I'm not sure whether it's possible to "not allow" halting. It could be that ERTS is writing crash dump, or is halting with errorlevel set, because it's no longer possible to continue.

One other approach I can think of is forceful termination of dirty schedulers. I understand that it may potentially introduce some issues under some OS-es (e.g. I cam imagine some exotic OS not releasing the mutex owned by cancelled thread), but it still feels like a safer alternative.

Nov 19 '21 02:11 max-au

I've made a PR #6370 with functionality that can be used to solve problems like this.

Oct 14 '22 12:10 rickard-green

otp otp copied to clipboard

erts halt may crash the emulator when dirty NIFs are still running

otp
otp copied to clipboard