flux-core
flux-core copied to clipboard
job-exec: possible scaling issue managing many job shells
I was running a test instance of size=1536 across 32 real nodes, each with 96 cores. When launching 128N/128p jobs there seemed to be some kind of performance issue in job-exec and jobs were not making progress. Perf report shows a lot of time spent in the libsubprocess watchers:
- 96.26% 0.00% job-exec libc-2.28.so [.] __GI___clone (inlined)
__GI___clone (inlined)
start_thread
module_thread
mod_main
flux_reactor_run
ev_run
- ev_run
- 84.93% ev_invoke_pending
- 21.03% remote_out_check_cb
- 14.38% fbuf_has_line
- 13.48% cbuf_peek_line
- 12.15% cbuf_is_valid
- 1.04% cbuf_mutex_is_locked (inlined)
1.02% __GI___pthread_mutex_trylock (inlined)
0.76% __GI___pthread_mutex_lock (inlined)
1.22% cbuf_used
0.98% flux_watcher_stop
- 17.73% remote_out_prep_cb
- 14.68% fbuf_has_line
- 13.93% cbuf_peek_line
+ 12.53% cbuf_is_valid
0.91% __GI___pthread_mutex_lock (inlined)
- 15.24% remote_in_check_cb
+ 5.03% cbuf_used
0.73% flux_watcher_stop
+ 10.36% remote_in_prep_cb
7.62% prepare_cb
6.98% check_cb
- 11.32% queue_events (inlined)
ev_feed_event
A sleep 1
prolog and epilog were configured, but the job-manager did not appear to have high utilization in perf
, so maybe that is a red herring.
The other thing to note is that the nodes are being heavily oversubscribed in this test, so perhaps that's a contributing factor.
Hmm, it seems like the output ~~prep~~ check watcher is always active, so while the output channel is open, every event loop will check the buffer for data. The prep watcher should be stopped when the buffer is empty and started only when it contains consumable data.
We might want to see what the profile looks like with the following change (all subprocess unit tests pass wtih this):
diff --git a/src/common/libsubprocess/remote.c b/src/common/libsubprocess/remote.c
index 3dd9a3ca1..2e009c994 100644
--- a/src/common/libsubprocess/remote.c
+++ b/src/common/libsubprocess/remote.c
@@ -56,7 +56,6 @@ static void start_channel_watchers (flux_subprocess_t *p)
flux_watcher_start (c->in_prep_w);
flux_watcher_start (c->in_check_w);
flux_watcher_start (c->out_prep_w);
- flux_watcher_start (c->out_check_w);
c = zhash_next (p->channels);
}
}
@@ -303,8 +302,10 @@ static void remote_out_prep_cb (flux_reactor_t *r,
* reactors are closed */
if ((c->line_buffered && fbuf_has_line (c->read_buffer))
|| (!c->line_buffered && fbuf_bytes (c->read_buffer) > 0)
- || (c->read_eof_received && !c->eof_sent_to_caller))
+ || (c->read_eof_received && !c->eof_sent_to_caller)) {
flux_watcher_start (c->out_idle_w);
+ flux_watcher_start (c->out_check_w);
+ }
}
static void remote_out_check_cb (flux_reactor_t *r,
@@ -315,6 +316,7 @@ static void remote_out_check_cb (flux_reactor_t *r,
struct subprocess_channel *c = arg;
flux_watcher_stop (c->out_idle_w);
+ flux_watcher_stop (c->out_check_w);
if ((c->line_buffered
&& (fbuf_has_line (c->read_buffer)
@@ -336,7 +338,6 @@ static void remote_out_check_cb (flux_reactor_t *r,
* reactors are closed */
if (c->eof_sent_to_caller) {
flux_watcher_stop (c->out_prep_w);
- flux_watcher_stop (c->out_check_w);
/* close input side as well */
flux_watcher_stop (c->in_prep_w);
@chu11 I think you are the most familiar with this code. Does this analysis sound right?
Edit: I am referring to this code
@garlick yeah, that analysis looks right to me. I think we could also do the "in" check watcher as well.
I remember the cbuf_is_valid()
from awhile back eating up a nice chunk of CPU
https://github.com/flux-framework/flux-core/issues/2527
it's a bug chunk of checks https://github.com/flux-framework/flux-core/blob/bff958bfbc4277d0e94204104b28fee77bee8560/src/common/liblsd/cbuf.c#L1760 and seems to take up the majority of the "has_line" check.
that's another thing we could tweak if we wanted to.
remember the cbuf_is_valid() from awhile back eating up a nice chunk of CPU
Looks like cbuf complies that out if NDBEBUG is set, along with other checks. Maybe we should be building it with that set.
Did you want to propose a PR for all this?
sure lemme give it a shot
another thought: since the remote subprocess already buffers data before sending it, what are we gaining by doing it again on the local side? Could we just have the message callback directly invoke the output callback(s)?
I retried with @garlick's patch above which didn't seem to improve things. The broker is still spending a majority of its time in these three remote_out
and remote_in
prep/check watchers:
- 97.01% 0.00% job-exec libc-2.28.so [.] __GI___clone ◆
__GI___clone (inlined) ▒
start_thread ▒
module_thread ▒
mod_main ▒
flux_reactor_run ▒
ev_run ▒
- ev_run ▒
- 87.27% ev_invoke_pending ▒
+ 33.25% remote_out_prep_cb ▒
+ 19.65% remote_in_check_cb ▒
+ 16.81% remote_in_prep_cb ▒
11.28% prepare_cb ▒
1.51% check_cb ▒
+ 9.72% queue_events (inlined) ▒
And I'm fairly certain I recompiled, but let me check again after a make clean
Ah this seems to have gotten remote_out_check_cb
out of the picture which is progress.
I guess remote_out_prep_cb()
is continually making the same checks even when nothing has been put into the buffer. Possibly instead of doing the heavy weight checks there, we should be doing them at the point where the buffer is modified, so we are only checking after conditions have changed. Then at those points we could start the check watcher.
another thought: since the remote subprocess already buffers data before sending it, what are we gaining by doing it again on the local side? Could we just have the message callback directly invoke the output callback(s)?
Hmmmm. It seems that we don't pass LINE_BUFFERED to the "server side" subprocess, so the raw data coming back still needs to be buffered for handling line buffering. But perhaps there is an optimization somewhere in here.
It seems that we don't pass LINE_BUFFERED to the "server side" subprocess,
It's part of the flux_cmd_t
object (an option) so it should be getting passed through to the server side.
I guess remote_out_prep_cb() is continually making the same checks even when nothing has been put into the buffer. Possibly instead of doing the heavy weight checks there, we should be doing them at the point where the buffer is modified, so we are only checking after conditions have changed. Then at those points we could start the check watcher.
I was just thinking something similar. fbuf_has_line()
calls cbuf_peek_line()
, and that's heavyweight b/c of A) the cbuf_is_valid()
which is a bunch of asserts including some memcmp()
calls. And B) "is a line available" is a pretty slow iteration looking for a \n
.
The broker is still spending a majority of its time in these three remote_out and remote_in prep/check watchers:
I'm really curious about if things would be a lot better if cbuf_is_valid()
was commented out. Looking at your perf output:
- 17.73% remote_out_prep_cb
- 14.68% fbuf_has_line
- 13.93% cbuf_peek_line
+ 12.53% cbuf_is_valid
0.91% __GI___pthread_mutex_lock (inlined)
it certainly looks like cbuf_is_valid()
is taking up the vast majority of the time for remote_out_prep_cb()
. Looking internally, cbuf_peek_line()
calls cbuf_is_valid()
twice.
- 15.24% remote_in_check_cb
+ 5.03% cbuf_used
0.73% flux_watcher_stop
+ 10.36% remote_in_prep_cb
7.62% prepare_cb
6.98% check_cb
I looked and cbuf_used()
also calls cbuf_is_valid()
. So I suspect that's a big chunk of the 5.03 percent in remote_in_check_cb()
and probably a decent chunk of remote_in_prep_cb()
too.
It certainly would be part of the problem.
Edit:
in remote_in_check_cb()
I guess I wasn't doing myself any favors calling fbuf_bytes()
twice for the most likely common path, thus calling calling cbuf_is_valid()
twice alot. Could whittle that done to one call.
if (fbuf_bytes (c->write_buffer) > 0) {
if (remote_write (c) < 0)
goto error;
}
if (!fbuf_bytes (c->write_buffer) && c->closed && !c->write_eof_sent) {
if (remote_close (c) < 0)
goto error;
c->write_eof_sent = true;
}
Edit2: I'm stupid .... remote_write()
could change bytes in buffer, has to be called twice
I can quickly try this since I have the test environment all set up already.
Compiling cbuf with -DNDEBUG
may be the way to go.
That causes compilation errors:
$ make V=1
depbase=`echo cbuf.lo | sed 's|[^/]*$|.deps/&|;s|\.lo$||'`;\
/bin/sh ../../../libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I. -I../../../config -DWITH_PTHREADS -DNDEBUG -Wall -Werror -Wno-strict-aliasing -Wno-error=deprecated-declarations -Werror=missing-field-initializers -Wno-parentheses -Wno-error=parentheses -g -O2 -MT cbuf.lo -MD -MP -MF $depbase.Tpo -c -o cbuf.lo cbuf.c &&\
mv -f $depbase.Tpo $depbase.Plo
libtool: compile: gcc -DHAVE_CONFIG_H -I. -I../../../config -DWITH_PTHREADS -DNDEBUG -Wall -Werror -Wno-strict-aliasing -Wno-error=deprecated-declarations -Werror=missing-field-initializers -Wno-parentheses -Wno-error=parentheses -g -O2 -MT cbuf.lo -MD -MP -MF .deps/cbuf.Tpo -c cbuf.c -fPIC -DPIC -o .libs/cbuf.o
cbuf.c: In function ‘cbuf_peek_line’:
cbuf.c:665:15: error: variable ‘l’ set but not used [-Werror=unused-but-set-variable]
665 | int n, m, l;
| ^
cbuf.c: In function ‘cbuf_read_line’:
cbuf.c:701:15: error: variable ‘l’ set but not used [-Werror=unused-but-set-variable]
701 | int n, m, l;
| ^
cbuf.c: In function ‘cbuf_replay_line’:
cbuf.c:738:15: error: variable ‘l’ set but not used [-Werror=unused-but-set-variable]
738 | int n, m, l;
| ^
cbuf.c: In function ‘cbuf_write_line’:
cbuf.c:812:23: error: variable ‘n’ set but not used [-Werror=unused-but-set-variable]
812 | int nfree, ncopy, n;
| ^
cc1: all warnings being treated as errors
make: *** [Makefile:537: cbuf.lo] Error 1
But I'll see if I can fix that as well (I actually think we have this fixed in our patches for conda)
Eh, I just ended up disabling those errors/warnings:
diff --git a/src/common/liblsd/Makefile.am b/src/common/liblsd/Makefile.am
index 3670604ae..009bf7aa6 100644
--- a/src/common/liblsd/Makefile.am
+++ b/src/common/liblsd/Makefile.am
@@ -1,13 +1,16 @@
AM_CFLAGS = \
$(WARNING_CFLAGS) \
$(CODE_COVERAGE_CFLAGS) \
- -Wno-parentheses -Wno-error=parentheses
+ -Wno-parentheses -Wno-error=parentheses \
+ -Wno-error=unused-but-set-variable \
+ -Wno-unused-but-set-variable
AM_LDFLAGS = \
$(CODE_COVERAGE_LIBS)
AM_CPPFLAGS = \
- -DWITH_PTHREADS
+ -DWITH_PTHREADS \
+ -DNDEBUG
noinst_LTLIBRARIES = liblsd.la
Ok, that removed cbuf_is_valid()
from the perf reprort as expected, but I think this just allows the event loop to spin faster -- the job-exec thread is still taking 100% CPU:
- 95.76% 0.00% job-exec libc-2.28.so [.] __GI___clone (inlined)
__GI___clone (inlined)
start_thread
module_thread
mod_main
flux_reactor_run
ev_run
- ev_run
- 79.70% ev_invoke_pending
- 22.65% remote_in_prep_cb
+ 4.32% cbuf_used
1.88% fbuf_bytes
- 22.62% remote_in_check_cb
+ 5.30% cbuf_used
0.97% fbuf_bytes
0.78% flux_watcher_stop
- 15.09% remote_out_prep_cb
- 6.37% fbuf_has_line
- 4.89% cbuf_peek_line
3.10% __GI___pthread_mutex_lock (inlined)
1.13% __pthread_mutex_unlock_usercnt
10.87% prepare_cb
2.60% check_cb
- 16.04% queue_events (inlined)
15.84% ev_feed_event
Hmm, it seems like the output ~~prep~~ check watcher is always active, so while the output channel is open, every event loop will check the buffer for data. The prep watcher should be stopped when the buffer is empty and started only when it contains consumable data.
disabling output prep/check was pretty easy and passed all tests, although I wonder if it works as a "fallout" and not due to correct implementation (more below).
the "input" side ("input" being writing to the subprocess) was a different story.
it's obviously been a long time since i wrote the flux buffers, but I think the main issue is that the write buffer is filled "outside of the reactor" in flux_subprocess_write()
. So when data is in the buffer and ev_io_start()
is called underneath the covers, if watchers haven't been started yet on the buffer, I don't know if ev_io_start()
does anything. And that can lead to issues.
I think the output side happens to work b/c the read buffer is filled within a check callback
, so the reactor is already humming along and an the next "prep" cycle can detect the buffer not being empty.
Ok, that removed cbuf_is_valid() from the perf reprort as expected, but I think this just allows the event loop to spin faster -- the job-exec thread is still taking 100% CPU:
Hmmmm. Interesting. Nothing strange in the logs? I'm not sure what the job is that you're running, but it makes me think that the buffers are not emptying, thus spinning.
The workload is multiple 128N/128p jobs on a size=1536 instance with a flat topology. When getting up to about 64 simultaneous jobs I started seeing this behavior. (That's 64*128 = 8192 active job shells all launched from rank 0)
per offline chat with @garlick, the general belief is that something not "libsubprocess is slow" is causing the primary performance issue (buffer not emptying, something causing reactor to constantly exit idle phase, etc.).
that said, we certainly want to speed up things in general
- deal with
cbuf_is_valid()
slowness, we'll leave that to issue #2527 - speed up libsubprocess in general
- general ability to limit the number of calls to all of the busy cbuf functions
- perhaps no buffering on
remote output
, call output callback directly?
- other ideas
- I dislike the fact
cbuf_peek_line()
does a loop and checks for a newline on every line of output, could this be sped up in some way. Hypothetically a "fbuf_write_line()` function and we can just flag that a line is in the buffer?
- I dislike the fact
deal with cbuf_is_valid() slowness, we'll leave that to issue https://github.com/flux-framework/flux-core/issues/2527
FYI - the solution propose by @garlick (patch in this comment) worked well.
@chu11, FYI I added this issue to our milestone for next week's release. LMK if that's not feasible. Seems like at least we could get the performance improvements in.
I think it should be doable, my analysis a few days ago suggested we can remove the additional buffering and call the output callback directly.
another thought: since the remote subprocess already buffers data before sending it, what are we gaining by doing it again on the local side? Could we just have the message callback directly invoke the output callback(s)?
Doh! Once I started looking to making actual changes it became obvious. We buffer b/c it's part of the API. The output callback is not passed the output data & length. The user calls flux_subprocess_read()
and its variants in the callback and can read as much or as little as they care for. So it needs to be buffered somewhere.
So I don't think this is going to be an optimization option.
I don't think this is going to be an optimization option.
Darn, I didn't notice that either.
Well there are still the other optimizations and the mystery of the reactor loop that wouldn't stop.
Could we add a new function to disable the existing callbacks and register a new one that is passed output as soon as it is ready, including the data and length? It is kind of ugly but could be used to work around this performance issue in the job-exec module.
Could we add a new function to disable the existing callbacks and register a new one that is passed output as soon as it is ready, including the data and length?
Another way to do it would be to add a flag that could be passed to flux_rexec()
like FLUX_SUBPROCESS_FLAGS_LOCAL_UNBUF that selects this behavior at the onset so those buffers don't have to be created then destroyed.
That's a good diea too, though we'd still need a different callback prototype:
typedef void (*flux_subprocess_output_f) (flux_subprocess_t *p,
const char *stream);
The original subprocess API was modeled after some existing event-driven interfaces that just pass back an event and the callback handles the event. A bit of an unfortunate choice in hindsight.
That reminds me - I have a subprocess cleanup branch in which i consolidated the stdout, stderr, and channel output callback ops into one (since the stream name is passed into the callback). Maybe that would be a good thing to do before adding an another output type, since otherwise it would require adding three new operations.
ok, I see three paths (some comments were in #5932)
-
simple "don't start prep/check unless necessary"
- output side - not too bad
- input side - tricky stuff, will take some refactoring work
- liklihood meet release May 7th, ok chance
-
use fbuf watcher instead, which can handle some prep/check start/stop trickiness when it is empty/non-empty
- output side - assumed this can be done, will probably have to handle some tricky stuff
- input side - dunno how hard it will be to handle tricky parts, could just skip this half
- liklihood meet release May 7th, i'm not confident but possible
-
remote_output - call remote output callback directly instead of using buffers, will require re-work of output callback function and all callers
- liklihood meet release May 7th: can't imagine this happening
personally, I think option 3 is the best (devil is in the details until I start), but that's the path I think would be best.