flux-core broker stuck at exit in `zmq_ctx

@jameshcorbett reported that a flux-coral2 CI test is failing with most recent flux-core+flux-sched with the following log:

  ok 1 - job-manager: load dws-jobtap plugin
  PASS: t1000-dws-dependencies.t 1 - job-manager: load dws-jobtap plugin
  ok 2 - job-manager: dws jobtap plugin works when coral2_dws is absent
  PASS: t1000-dws-dependencies.t 2 - job-manager: dws jobtap plugin works when coral2_dws is absent
  ok 3 - job-manager: dws jobtap plugin works when RPCs succeed
  PASS: t1000-dws-dependencies.t 3 - job-manager: dws jobtap plugin works when RPCs succeed
  ok 4 - job-manager: dws jobtap plugin works when creation RPC fails
  PASS: t1000-dws-dependencies.t 4 - job-manager: dws jobtap plugin works when creation RPC fails
  ok 5 - job-manager: dws jobtap plugin works when setup RPC fails
  PASS: t1000-dws-dependencies.t 5 - job-manager: dws jobtap plugin works when setup RPC fails
  ok 6 - job-manager: dws jobtap plugin works when post_run RPC fails
  PASS: t1000-dws-dependencies.t 6 - job-manager: dws jobtap plugin works when post_run RPC fails
  ok 7 - job-manager: dws jobtap plugin works when job hits exception during prolog
  PASS: t1000-dws-dependencies.t 7 - job-manager: dws jobtap plugin works when job hits exception during prolog
  # passed all 7 test(s)
  flux-start: 0 (pid 8494) Killed
  1..7

Note all tests pass, but the broker is force-terminated with SIGKILL.

We ran the individual test in a docker container and attached with gdb. This is where the broker is stuck:

(gdb) where
#0  0x00007f9ef95390e1 in poll () from /lib64/libc.so.6
#1  0x00007f9efa92861e in zmq::signaler_t::wait(int) const ()
   from /lib64/libzmq.so.5
#2  0x00007f9efa904cc4 in zmq::mailbox_t::recv(zmq::command_t*, int) ()
   from /lib64/libzmq.so.5
#3  0x00007f9efa8f09c4 in zmq::ctx_t::terminate() () from /lib64/libzmq.so.5
#4  0x00007f9efa94982e in zmq_ctx_term () from /lib64/libzmq.so.5
#5  0x000000000040caa1 in main (argc=<optimized out>, argv=<optimized out>)
    at broker.c:548

Not sure what in this test triggering this condition, but wondering if anyone has any hints on where to look. This didn't seem to be a problem until recently.

Oct 03 '23 23:10 grondo

The zmq_ctx_term documentation states:

After interrupting all blocking calls, zmq_ctx_term() shall block until the following conditions are satisfied:

All sockets open within context have been closed with zmq_close().

For each socket within context, all messages sent by the application with zmq_send() have either been physically transferred to a network peer, or the socket's linger period set with the ZMQ_LINGER socket option has expired.

So perhaps there is an outstanding socket still open or a message stuck?

Oct 03 '23 23:10 grondo

If I "comment out" the last test in t1000-dws-dependencies.t, then the hang does not occur. In fact, the following test alone reproduces the issue by itself:

#!/bin/sh

test_description='Test dws-jobtap plugin with fake coral2_dws.py'

. $(dirname $0)/sharness.sh

test_under_flux 2 job

flux setattr log-stderr-level 1

PLUGINPATH=${FLUX_BUILD_DIR}/src/job-manager/plugins/.libs
DWS_SCRIPT=${SHARNESS_TEST_SRCDIR}/dws-dependencies/coral2_dws.py
DEPENDENCY_NAME="dws-create"
PROLOG_NAME="dws-setup"
EPILOG_NAME="dws-epilog"

test_expect_success 'job-manager: load dws-jobtap plugin' '
	flux jobtap load ${PLUGINPATH}/dws-jobtap.so
'

test_expect_success 'job-manager: dws jobtap plugin works when job hits exception during prolog' '
	create_jobid=$(flux mini submit -t 8 --output=dws4.out --error=dws4.out \
		flux python ${DWS_SCRIPT} --setup-hang) &&
	flux job wait-event -vt 15 -p guest.exec.eventlog ${create_jobid} shell.start &&
	jobid=$(flux mini submit --setattr=system.dw="foo" hostname) &&
	flux job wait-event -vt 5 -m description=${PROLOG_NAME} \
		${jobid} prolog-start &&
	flux job cancel $jobid
	flux job wait-event -vt 1 ${jobid} exception &&
	flux job wait-event -vt 5 -m description=${PROLOG_NAME} -m status=1 \
		${jobid} prolog-finish &&
	flux job wait-event -vt 5 -m description=${EPILOG_NAME} \
		${jobid} epilog-start &&
	flux job wait-event -vt 5 -m description=${EPILOG_NAME} \
		${jobid} epilog-finish &&
	flux job wait-event -vt 5 ${jobid} clean &&
	flux job wait-event -vt 5 ${create_jobid} clean
'
test_done

Note that this test seems to be simulating a "hang" in the dws server if I'm reading it right. That is, the jobtap plugin sends an RPC to the dws service, but in this test the dws service doesn't respond. I wonder if that has something to do with the later hang in zmq_ctx_term().

Oct 04 '23 00:10 grondo

The cause here seems to be a leaked future in the dws-jobtap plugin. When the service fails to respond, the callback wherein the future is destroyed is not invoked, so a future is leaked.

If the future is set to be destroyed along with the job, then this hang goes away:

diff --git a/src/job-manager/plugins/dws-jobtap.c b/src/job-manager/plugins/dws-jobtap.c
index fe186a8..2837a52 100644
--- a/src/job-manager/plugins/dws-jobtap.c
+++ b/src/job-manager/plugins/dws-jobtap.c
@@ -290,13 +290,13 @@ static void setup_rpc_cb (flux_future_t *f, void *arg)
                                prolog_active);
         }
     }
-
 done:
-    flux_future_destroy (f);
+    return;
 }
 
 static void fetch_R_callback (flux_future_t *f, void *arg)
 {
+    flux_plugin_t *p = arg;
     json_t *R;
     flux_t *h = flux_future_get_flux (f);
     struct create_arg_t *args = flux_future_aux_get (f, "flux::fetch_R");
@@ -355,7 +355,13 @@ static void fetch_R_callback (flux_future_t *f, void *arg)
                            0,
                            "Failed to send dws.setup RPC",
                            prolog_active);
+        goto done;
     }
+    flux_jobtap_job_aux_set (p,
+                             args->id,
+                             NULL,
+                             setup_rpc_fut,
+                             (flux_free_f) flux_future_destroy);
 
 done:
     flux_future_destroy (f);
@@ -412,7 +418,7 @@ static int run_cb (flux_plugin_t *p,
                                     prolog_active,
                                     NULL)
                    < 0
-            || flux_future_then (fetch_R_future, -1., fetch_R_callback, NULL) < 0) {
+            || flux_future_then (fetch_R_future, -1., fetch_R_callback, p) < 0) {
             flux_future_destroy (fetch_R_future);
             flux_log_error (h,
                             "dws-jobtap: "

I'll open an issue in flux-coral2 on the leaks. However, this hang behavior does seem to be new...

Oct 04 '23 01:10 grondo

A leaked future causes a hang? I never would have been able to figure that out... thanks for digging into this @grondo !

Oct 04 '23 01:10 jameshcorbett

Great job running this down. Hmm, I belive the future takes a reference on the flux_t handle which for a broker module contains one end of a zeromq socket for the shmem:// connection. So when the module unloads, the zeromq socket might not be getting destroyed due to the future's reference. The broker just checks if all the modules were cleanly shut down on the presumption that if they went through their teardown, the flux_t refcounts would be zero.

I'll open an issue in flux-core.

Edit: oops! This is flux core and this is the issue. Got it!

Oct 04 '23 01:10 garlick

Well I guess a test failed and that uncovered a bug in the module, so that's a good outcome compared to before. But ideally we would log something, avoid the hang, and exit with a nonzero exit code. Maybe we could manage:

job-manager: module leaked a socket reference - was a future leaked?
broker: skipping 0MQ shutdown due to presumed module socket leak

Oct 04 '23 01:10 garlick

Just happened to notice the commit that added the reference on the handle: 01b383ba5ad3ea

Thought I'd x-ref it here in case someone (like me) is tempted to muck around with this refcounting.

Oct 24 '23 14:10 garlick

flux-core
flux-core copied to clipboard

broker stuck at exit in `zmq_ctx_term()`

flux-core flux-core copied to clipboard

broker stuck at exit in `zmq_ctx_term()`

flux-core
flux-core copied to clipboard