Race condition in helper_thread_stopped_callback()
Please Tell Us About Your Operating System
- RHEL-10 / Fedora-41
- libreswan-5.1, 5.2 and 5.3
Please Describe Your IPsec Configuration
- No connections, default configuration
Please Tell Us What Happened?
In our CI we are sometimes hitting the following crash. Over the last year we saw ~150 occurences during a ton of testing but it so far it resists all efforts to be reproduced reliably and hence we only care coredumps that all look the same:
> (gdb) thread apply all bt
Thread 2 (Thread 0x7f792939f940 (LWP 701914)):
#0 0x00007f792a004de8 in _fini () from /lib64/libgcc_s.so.1
#1 0x00007f792a5552f0 in _dl_fini () at dl-fini.c:120
#2 0x00007f7929e465b1 in __run_exit_handlers (status=0, listp=0x7f7929fd2680 <__exit_funcs>,
run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108
#3 0x00007f7929e46670 in __GI_exit (status=<optimized out>) at exit.c:138
#4 0x000055632f27517c in exit_epilogue () at whack_shutdown.c:229
#5 0x000055632f27538f in server_stopped_callback (r=<optimized out>) at whack_shutdown.c:310
#6 0x000055632f2d0b88 in run_server (conffile=0x7fffe9ab5ec6 "/etc/ipsec.conf", logger=0x55632f432a20
<global_logger>) at server.c:1077
#7 0x000055632f23ae00 in main (argc=<optimized out>, argv=<optimized out>) at plutomain.c:1554
Thread 1 (Thread 0x7f79281b36c0 (LWP 701927)):
#0 ___pthread_mutex_lock (mutex=mutex@entry=0x0) at pthread_mutex_lock.c:80
#1 0x00007f792a27999d in PR_Lock (lock=0x0) at pthreads/../../../../nspr/pr/src/pthreads/ptsynch.c:161
#2 0x00007f792a2838a4 in _pt_thread_death_internal (arg=0x7f7918000eb0, callDestructors=1) at
pthreads/../../../../nspr/pr/src/pthreads/ptthread.c:796
#3 0x00007f792a283a4b in _pt_thread_death (arg=0x7f7918000eb0) at
pthreads/../../../../nspr/pr/src/pthreads/ptthread.c:784
#4 0x00007f7929e956f1 in __GI___nptl_deallocate_tsd () at nptl_deallocate_tsd.c:73
#5 __GI___nptl_deallocate_tsd () at nptl_deallocate_tsd.c:22
#6 0x00007f7929e97ffb in start_thread (arg=<optimized out>) at pthread_create.c:456
#7 0x00007f7929f08afc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
What happens is that PR_Cleanup() cleans up thread resources while there is still a helper thread alive and once it attemps to touch mutex the process SEGFAULTs because clean-up happened already.
It looks like when helper_thread_stopped_callback() runs it assumes the thread is gone but it seems that this is not guaranteed. If I read it correctly server_helpers_stopped_callback() is meant to be a barrier to make sure all helper threads are terminated before proceeding to exit_epilogue() (where PR_Cleanup() happens). Since there is a possiblity to pass context in helper_thread_stopped_callback(), can't it be used to pass thread ID so that we can make sure thread is really gone (during this callback before incrementinghelper_threads_stopped` value and reschedule the callback otherwise?
Relevant log output
... where PR_Cleanup() and PR_Lock() are in NSS's NSPR.
It looks like when helper_thread_stopped_callback() runs it assumes the thread is gone but it seems that this is not guaranteed.
No. Up until now Libreswan's concern has been limited to the helper thread being done with the helper queue and NSS. It wasn't too fussed about when, after this, the thread exited. And according to GLIBC's pthread_create(3) this is ok: The new thread terminates in one of the following ways: ... Any of the threads in the process calls exit(3), or the main thread performs a return from main(). This causes the termination of all threads in the process.
But it appears that NSPR has other ideas, and while it feels strongly about them, it isn't clear to me what they are.
My reading of PR_Cleanup()'s description:
/*
* Perform a graceful shutdown of NSPR. PR_Cleanup() may be called by
* the primordial thread near the end of the main() function.
*
* PR_Cleanup() attempts to synchronize the natural termination of
* process. It does that by blocking the caller, if and only if it is
* the primordial thread, until the number of user threads has dropped
* to zero. When the primordial thread returns from main(), the process
* will immediately and silently exit. That is, it will (if necessary)
* forcibly terminate any existing threads and exit without significant
* blocking and there will be no error messages or core files.
*
* PR_Cleanup() returns PR_SUCCESS if NSPR is successfully shutdown,
* or PR_FAILURE if the calling thread of this function is not the
* primordial thread.
*/
NSPR_API(PRStatus) PR_Cleanup(void);
is that the main thread calling:
NSS_Shutdown()
PR_Cleanup()
exit()
is correct. Only once the helper threads have all existed should PR_Cleanup() return.
Yet the backtrace suggests that, instead:
- the main thread's PR_Cleanup() call returned while a helper thread was still running
- the helper thread called PR_Cleanup() via a thread destructor (would you know how it was added, I don't think it was libreswan) that call should not have attempted a cleanup, and instead returned PR_FAILURE
@The-Mule if you agree with my analysis, can you file an upstream bug? Depending on that feedback we can look at a work-around - either a join or detach may work. (another data point is that this isn't turning up in libreswan's testing)
Hm, it looks like NSPR only cares for termination of its own "user" threads while it does not wait for termination of native "system" threads:
Enumerators
PR_USER_THREAD
PR_Cleanup blocks until the last thread of type PR_USER_THREAD terminates.
PR_SYSTEM_THREAD
NSPR ignores threads of type PR_SYSTEM_THREAD when determining when
a call to PR_Cleanup should return.
Description
Threads can be either user threads or system threads. NSPR allows the client to synchronize
the termination of all user threads and ignores those created as system threads. This
arrangement implies that a system thread should not have volatile data that needs to be safely
stored away. The applicability of system threads is somewhat dubious; therefore, they should
be used with caution.
https://www-archive.mozilla.org/projects/nspr/reference/html/prthrd.html#14937
At the same time:
Note
As of NSPR release v3.0, PR_AttachThread and PR_DetachThread are obsolete. A native
thread not created by NSPR is automatically attached the first time it calls an NSPR function,
and automatically detached when it exits.
https://www-archive.mozilla.org/projects/nspr/reference/html/prthrd.html#15218
All in all, I agree with your analysis - a helper thread is running _pt_thread_death_internal after PR_Cleanup() returned. But I don't think it NSS/NSPR bug based on aforementioned documentation and it suggests that application should take care of its own threads before calling PR_Cleanup().
At the same time I totally admit that this is a very (if not extremely) rare race condition. We hit it a few times a month though during a NetworkManager-libreswan stress testing and even though I still have no clues what are the contributing factors, I don't think normal users will hit this.
Perhaps pthread_join() of helper threads in server_helpers_stopped_callback() wouldn't be a bad idea but I stumbled upon the following:
/*
* Repeatedly nudge the helper threads until they all exit.
*
* Note that pthread_join() doesn't work here: an any-thread join may
* end up joining an unrelated thread (for instance the CRL helper);
* and a specific thread join may block waiting for the wrong thread.
*/
static void (*server_helpers_stopped_callback)(void);
According to PR_GetThreadType() these helpers are PR_USER_THREADs (i.e., the PT_THREAD_SYSTEM bit isn't set):
| NSPR thinks I'm a PR_USER_THREAD thread
| NSPR thinks I'm a PR_GLOBAL_THREAD thread
| NSPR thinks I'm a PR_JOINABLE_THREAD thread
| detaching NSPR
Looking at pt_AttachThread() , it sets the hidden PT_THREAD_FOREIGN bit and that does stop things:
PR_JoinThread()rejects callsPT_DetachThread()calls_pt_thread_death_internal()and that removes the thread from NSPR's internal structures
I've added a call to private/PT_DetachThread() (inside if-debug for now), to see what happens.
/*
* Repeatedly nudge the helper threads until they all exit.
*
* Note that pthread_join() doesn't work here: an any-thread join may
* end up joining an unrelated thread (for instance the CRL helper);
* and a specific thread join may block waiting for the wrong thread.
*/
I think my comment about any-thread join still holds, while the CRL helper thread is gone, there could be other threads (for instance, a thread to manage EAP or DNS). My comment about joining a specific thread was for code simply iterating through known threads.
Instead, if main only tries to join the thread that triggered the helper-stopped callback, then it can't block on the wrong thread.
More thoughts.
I suspect the most robust way to handle this is for libreswan to adopt the dogma that all threads are explicitly joined before shutting down.
While we can add PR_DetachThread() to the thread's exit path, there's still the chance that code inadvertently call into NSS/NSPR causing the thread to be re-attached.
Interesting, thanks for actually checking thread flags, I should have done that instead of deciphering missing bits in NSS documentation.
I am afraid that already obsoleted PR_DetachThread() might go away in the future anyway. Perhaps we can add a flag to to mark "nss threads" and only join these.
BTW, The CRL thread was eliminated (the PAM thread is long gone) so currently "nss threads" are all we have :-)
does changing:
/* all done; cleanup */
for (unsigned h = 0; h < helper_threads_started; h++) {
struct helper_thread *w = &helper_threads[h];
free_logger(&w->logger, HERE);
}
to
/* all done; cleanup */
for (unsigned h = 0; h < helper_threads_started; h++) {
struct helper_thread *w = &helper_threads[h];
pthread_join(w->pid, NULL);
free_logger(&w->logger, HERE);
}
help?
It's after all the threads have indicated that they are exiting
Yes, this is what I had in mind, let me check.
Perfect, we did the testing and I can confirm that v5.3 failed after 40 test iterations and v5.3+pthread_join() did not even with 500 iterations (and once we downgraded back to v5.3 it eventually crashed again in less then 100 iterations). Thanks, @cagney! It's pity we did not get any help from nspr but I suppose this is even cleaner solution now.
I've pushed it, drum fingers ....