flux-core job-exec: possible scaling issue managing many job shells

I was running a test instance of size=1536 across 32 real nodes, each with 96 cores. When launching 128N/128p jobs there seemed to be some kind of performance issue in job-exec and jobs were not making progress. Perf report shows a lot of time spent in the libsubprocess watchers:

-   96.26%     0.00%  job-exec         libc-2.28.so           [.] __GI___clone (inlined)
     __GI___clone (inlined)
     start_thread
     module_thread
     mod_main
     flux_reactor_run
     ev_run
   - ev_run
      - 84.93% ev_invoke_pending
         - 21.03% remote_out_check_cb
            - 14.38% fbuf_has_line
               - 13.48% cbuf_peek_line
                  - 12.15% cbuf_is_valid
                     - 1.04% cbuf_mutex_is_locked (inlined)
                          1.02% __GI___pthread_mutex_trylock (inlined)
                    0.76% __GI___pthread_mutex_lock (inlined)
              1.22% cbuf_used
              0.98% flux_watcher_stop
         - 17.73% remote_out_prep_cb
            - 14.68% fbuf_has_line
               - 13.93% cbuf_peek_line
                  + 12.53% cbuf_is_valid
                    0.91% __GI___pthread_mutex_lock (inlined)
         - 15.24% remote_in_check_cb
            + 5.03% cbuf_used
              0.73% flux_watcher_stop
         + 10.36% remote_in_prep_cb
           7.62% prepare_cb
           6.98% check_cb
      - 11.32% queue_events (inlined)
           ev_feed_event

A sleep 1 prolog and epilog were configured, but the job-manager did not appear to have high utilization in perf, so maybe that is a red herring.

The other thing to note is that the nodes are being heavily oversubscribed in this test, so perhaps that's a contributing factor.

Apr 29 '24 15:04 grondo

Hmm, it seems like the output ~~prep~~ check watcher is always active, so while the output channel is open, every event loop will check the buffer for data. The prep watcher should be stopped when the buffer is empty and started only when it contains consumable data.

We might want to see what the profile looks like with the following change (all subprocess unit tests pass wtih this):

diff --git a/src/common/libsubprocess/remote.c b/src/common/libsubprocess/remote.c
index 3dd9a3ca1..2e009c994 100644
--- a/src/common/libsubprocess/remote.c
+++ b/src/common/libsubprocess/remote.c
@@ -56,7 +56,6 @@ static void start_channel_watchers (flux_subprocess_t *p)
         flux_watcher_start (c->in_prep_w);
         flux_watcher_start (c->in_check_w);
         flux_watcher_start (c->out_prep_w);
-        flux_watcher_start (c->out_check_w);
         c = zhash_next (p->channels);
     }
 }
@@ -303,8 +302,10 @@ static void remote_out_prep_cb (flux_reactor_t *r,
      * reactors are closed */
     if ((c->line_buffered && fbuf_has_line (c->read_buffer))
         || (!c->line_buffered && fbuf_bytes (c->read_buffer) > 0)
-        || (c->read_eof_received && !c->eof_sent_to_caller))
+        || (c->read_eof_received && !c->eof_sent_to_caller)) {
         flux_watcher_start (c->out_idle_w);
+        flux_watcher_start (c->out_check_w);
+    }
 }
 
 static void remote_out_check_cb (flux_reactor_t *r,
@@ -315,6 +316,7 @@ static void remote_out_check_cb (flux_reactor_t *r,
     struct subprocess_channel *c = arg;
 
     flux_watcher_stop (c->out_idle_w);
+    flux_watcher_stop (c->out_check_w);
 
     if ((c->line_buffered
          && (fbuf_has_line (c->read_buffer)
@@ -336,7 +338,6 @@ static void remote_out_check_cb (flux_reactor_t *r,
      * reactors are closed */
     if (c->eof_sent_to_caller) {
         flux_watcher_stop (c->out_prep_w);
-        flux_watcher_stop (c->out_check_w);
 
         /* close input side as well */
         flux_watcher_stop (c->in_prep_w);

@chu11 I think you are the most familiar with this code. Does this analysis sound right?

Edit: I am referring to this code

Apr 29 '24 16:04 garlick

@garlick yeah, that analysis looks right to me. I think we could also do the "in" check watcher as well.

I remember the cbuf_is_valid() from awhile back eating up a nice chunk of CPU

https://github.com/flux-framework/flux-core/issues/2527

it's a bug chunk of checks https://github.com/flux-framework/flux-core/blob/bff958bfbc4277d0e94204104b28fee77bee8560/src/common/liblsd/cbuf.c#L1760 and seems to take up the majority of the "has_line" check.

that's another thing we could tweak if we wanted to.

Apr 29 '24 16:04 chu11

remember the cbuf_is_valid() from awhile back eating up a nice chunk of CPU

Looks like cbuf complies that out if NDBEBUG is set, along with other checks. Maybe we should be building it with that set.

Did you want to propose a PR for all this?

Apr 29 '24 16:04 garlick

sure lemme give it a shot

Apr 29 '24 16:04 chu11

another thought: since the remote subprocess already buffers data before sending it, what are we gaining by doing it again on the local side? Could we just have the message callback directly invoke the output callback(s)?

Apr 29 '24 16:04 garlick

I retried with @garlick's patch above which didn't seem to improve things. The broker is still spending a majority of its time in these three remote_out and remote_in prep/check watchers:

-   97.01%     0.00%  job-exec         libc-2.28.so           [.] __GI___clone ◆
     __GI___clone (inlined)                                                    ▒
     start_thread                                                              ▒
     module_thread                                                             ▒
     mod_main                                                                  ▒
     flux_reactor_run                                                          ▒
     ev_run                                                                    ▒
   - ev_run                                                                    ▒
      - 87.27% ev_invoke_pending                                               ▒
         + 33.25% remote_out_prep_cb                                           ▒
         + 19.65% remote_in_check_cb                                           ▒
         + 16.81% remote_in_prep_cb                                            ▒
           11.28% prepare_cb                                                   ▒
           1.51% check_cb                                                      ▒
      + 9.72% queue_events (inlined)                                           ▒

And I'm fairly certain I recompiled, but let me check again after a make clean

Apr 29 '24 17:04 grondo

Ah this seems to have gotten remote_out_check_cb out of the picture which is progress.

I guess remote_out_prep_cb() is continually making the same checks even when nothing has been put into the buffer. Possibly instead of doing the heavy weight checks there, we should be doing them at the point where the buffer is modified, so we are only checking after conditions have changed. Then at those points we could start the check watcher.

Apr 29 '24 17:04 garlick

another thought: since the remote subprocess already buffers data before sending it, what are we gaining by doing it again on the local side? Could we just have the message callback directly invoke the output callback(s)?

Hmmmm. It seems that we don't pass LINE_BUFFERED to the "server side" subprocess, so the raw data coming back still needs to be buffered for handling line buffering. But perhaps there is an optimization somewhere in here.

Apr 29 '24 17:04 chu11

It seems that we don't pass LINE_BUFFERED to the "server side" subprocess,

It's part of the flux_cmd_t object (an option) so it should be getting passed through to the server side.

Apr 29 '24 17:04 garlick

I guess remote_out_prep_cb() is continually making the same checks even when nothing has been put into the buffer. Possibly instead of doing the heavy weight checks there, we should be doing them at the point where the buffer is modified, so we are only checking after conditions have changed. Then at those points we could start the check watcher.

I was just thinking something similar. fbuf_has_line() calls cbuf_peek_line(), and that's heavyweight b/c of A) the cbuf_is_valid() which is a bunch of asserts including some memcmp() calls. And B) "is a line available" is a pretty slow iteration looking for a \n.

Apr 29 '24 17:04 chu11

The broker is still spending a majority of its time in these three remote_out and remote_in prep/check watchers:

I'm really curious about if things would be a lot better if cbuf_is_valid() was commented out. Looking at your perf output:

         - 17.73% remote_out_prep_cb
            - 14.68% fbuf_has_line
               - 13.93% cbuf_peek_line
                  + 12.53% cbuf_is_valid
                    0.91% __GI___pthread_mutex_lock (inlined)

it certainly looks like cbuf_is_valid() is taking up the vast majority of the time for remote_out_prep_cb(). Looking internally, cbuf_peek_line() calls cbuf_is_valid() twice.

         - 15.24% remote_in_check_cb
            + 5.03% cbuf_used
              0.73% flux_watcher_stop
         + 10.36% remote_in_prep_cb
           7.62% prepare_cb
           6.98% check_cb

I looked and cbuf_used() also calls cbuf_is_valid(). So I suspect that's a big chunk of the 5.03 percent in remote_in_check_cb() and probably a decent chunk of remote_in_prep_cb() too.

It certainly would be part of the problem.

Edit:

in remote_in_check_cb() I guess I wasn't doing myself any favors calling fbuf_bytes() twice for the most likely common path, thus calling calling cbuf_is_valid() twice alot. Could whittle that done to one call.

    if (fbuf_bytes (c->write_buffer) > 0) {                                                                                                 
        if (remote_write (c) < 0)                                                                                                           
            goto error;                                                                                                                     
    }                                                                                                                                       
                                                                                                                                            
    if (!fbuf_bytes (c->write_buffer) && c->closed && !c->write_eof_sent) {                                                                 
        if (remote_close (c) < 0)                                                                                                           
            goto error;                                                                                                                     
        c->write_eof_sent = true;                                                                                                           
    }

Edit2: I'm stupid .... remote_write() could change bytes in buffer, has to be called twice

Apr 29 '24 17:04 chu11

I can quickly try this since I have the test environment all set up already.

Apr 29 '24 17:04 grondo

Compiling cbuf with -DNDEBUG may be the way to go.

Apr 29 '24 17:04 garlick

That causes compilation errors:

$ make V=1
depbase=`echo cbuf.lo | sed 's|[^/]*$|.deps/&|;s|\.lo$||'`;\
/bin/sh ../../../libtool  --tag=CC   --mode=compile gcc -DHAVE_CONFIG_H -I. -I../../../config  -DWITH_PTHREADS -DNDEBUG  -Wall -Werror -Wno-strict-aliasing -Wno-error=deprecated-declarations -Werror=missing-field-initializers  -Wno-parentheses -Wno-error=parentheses -g -O2 -MT cbuf.lo -MD -MP -MF $depbase.Tpo -c -o cbuf.lo cbuf.c &&\
mv -f $depbase.Tpo $depbase.Plo
libtool: compile:  gcc -DHAVE_CONFIG_H -I. -I../../../config -DWITH_PTHREADS -DNDEBUG -Wall -Werror -Wno-strict-aliasing -Wno-error=deprecated-declarations -Werror=missing-field-initializers -Wno-parentheses -Wno-error=parentheses -g -O2 -MT cbuf.lo -MD -MP -MF .deps/cbuf.Tpo -c cbuf.c  -fPIC -DPIC -o .libs/cbuf.o
cbuf.c: In function ‘cbuf_peek_line’:
cbuf.c:665:15: error: variable ‘l’ set but not used [-Werror=unused-but-set-variable]
  665 |     int n, m, l;
      |               ^
cbuf.c: In function ‘cbuf_read_line’:
cbuf.c:701:15: error: variable ‘l’ set but not used [-Werror=unused-but-set-variable]
  701 |     int n, m, l;
      |               ^
cbuf.c: In function ‘cbuf_replay_line’:
cbuf.c:738:15: error: variable ‘l’ set but not used [-Werror=unused-but-set-variable]
  738 |     int n, m, l;
      |               ^
cbuf.c: In function ‘cbuf_write_line’:
cbuf.c:812:23: error: variable ‘n’ set but not used [-Werror=unused-but-set-variable]
  812 |     int nfree, ncopy, n;
      |                       ^
cc1: all warnings being treated as errors
make: *** [Makefile:537: cbuf.lo] Error 1

But I'll see if I can fix that as well (I actually think we have this fixed in our patches for conda)

Apr 29 '24 17:04 grondo

Eh, I just ended up disabling those errors/warnings:

diff --git a/src/common/liblsd/Makefile.am b/src/common/liblsd/Makefile.am
index 3670604ae..009bf7aa6 100644
--- a/src/common/liblsd/Makefile.am
+++ b/src/common/liblsd/Makefile.am
@@ -1,13 +1,16 @@
 AM_CFLAGS = \
        $(WARNING_CFLAGS) \
        $(CODE_COVERAGE_CFLAGS) \
-       -Wno-parentheses -Wno-error=parentheses
+       -Wno-parentheses -Wno-error=parentheses \
+       -Wno-error=unused-but-set-variable \
+       -Wno-unused-but-set-variable
 
 AM_LDFLAGS = \
        $(CODE_COVERAGE_LIBS)
 
 AM_CPPFLAGS = \
-       -DWITH_PTHREADS
+       -DWITH_PTHREADS \
+       -DNDEBUG
 
 noinst_LTLIBRARIES = liblsd.la

Apr 29 '24 17:04 grondo

Ok, that removed cbuf_is_valid() from the perf reprort as expected, but I think this just allows the event loop to spin faster -- the job-exec thread is still taking 100% CPU:

-   95.76%     0.00%  job-exec         libc-2.28.so           [.] __GI___clone (inlined)                                                                                                                                                                                     
     __GI___clone (inlined)                                                                                                                                                                                                                                                  
     start_thread                                                                                                                                                                                                                                                            
     module_thread                                                                                                                                                                                                                                                           
     mod_main                                                                                                                                                                                                                                                                
     flux_reactor_run                                                                                                                                                                                                                                                        
     ev_run                                                                                                                                                                                                                                                                  
   - ev_run                                                                                                                                                                                                                                                                  
      - 79.70% ev_invoke_pending                                                                                                                                                                                                                                             
         - 22.65% remote_in_prep_cb                                                                                                                                                                                                                                          
            + 4.32% cbuf_used                                                                                                                                                                                                                                                
              1.88% fbuf_bytes                                                                                                                                                                                                                                               
         - 22.62% remote_in_check_cb                                                                                                                                                                                                                                         
            + 5.30% cbuf_used                                                                                                                                                                                                                                                
              0.97% fbuf_bytes                                                                                                                                                                                                                                               
              0.78% flux_watcher_stop                                                                                                                                                                                                                                        
         - 15.09% remote_out_prep_cb                                                                                                                                                                                                                                         
            - 6.37% fbuf_has_line                                                                                                                                                                                                                                            
               - 4.89% cbuf_peek_line                                                                                                                                                                                                                                        
                    3.10% __GI___pthread_mutex_lock (inlined)                                                                                                                                                                                                                
                    1.13% __pthread_mutex_unlock_usercnt                                                                                                                                                                                                                     
           10.87% prepare_cb                                                                                                                                                                                                                                                 
           2.60% check_cb                                                                                                                                                                                                                                                    
      - 16.04% queue_events (inlined)                                                                                                                                                                                                                                        
           15.84% ev_feed_event

Apr 29 '24 18:04 grondo

Hmm, it seems like the output ~~prep~~ check watcher is always active, so while the output channel is open, every event loop will check the buffer for data. The prep watcher should be stopped when the buffer is empty and started only when it contains consumable data.

disabling output prep/check was pretty easy and passed all tests, although I wonder if it works as a "fallout" and not due to correct implementation (more below).

the "input" side ("input" being writing to the subprocess) was a different story.

it's obviously been a long time since i wrote the flux buffers, but I think the main issue is that the write buffer is filled "outside of the reactor" in flux_subprocess_write(). So when data is in the buffer and ev_io_start() is called underneath the covers, if watchers haven't been started yet on the buffer, I don't know if ev_io_start() does anything. And that can lead to issues.

I think the output side happens to work b/c the read buffer is filled within a check callback, so the reactor is already humming along and an the next "prep" cycle can detect the buffer not being empty.

Apr 29 '24 20:04 chu11

Ok, that removed cbuf_is_valid() from the perf reprort as expected, but I think this just allows the event loop to spin faster -- the job-exec thread is still taking 100% CPU:

Hmmmm. Interesting. Nothing strange in the logs? I'm not sure what the job is that you're running, but it makes me think that the buffers are not emptying, thus spinning.

Apr 29 '24 20:04 chu11

The workload is multiple 128N/128p jobs on a size=1536 instance with a flat topology. When getting up to about 64 simultaneous jobs I started seeing this behavior. (That's 64*128 = 8192 active job shells all launched from rank 0)

Apr 29 '24 21:04 grondo

per offline chat with @garlick, the general belief is that something not "libsubprocess is slow" is causing the primary performance issue (buffer not emptying, something causing reactor to constantly exit idle phase, etc.).

that said, we certainly want to speed up things in general

deal with cbuf_is_valid() slowness, we'll leave that to issue #2527
speed up libsubprocess in general
- general ability to limit the number of calls to all of the busy cbuf functions
- perhaps no buffering on remote output, call output callback directly?
other ideas
- I dislike the fact cbuf_peek_line() does a loop and checks for a newline on every line of output, could this be sped up in some way. Hypothetically a "fbuf_write_line()` function and we can just flag that a line is in the buffer?

Apr 29 '24 22:04 chu11

deal with cbuf_is_valid() slowness, we'll leave that to issue https://github.com/flux-framework/flux-core/issues/2527

FYI - the solution propose by @garlick (patch in this comment) worked well.

Apr 29 '24 22:04 grondo

@chu11, FYI I added this issue to our milestone for next week's release. LMK if that's not feasible. Seems like at least we could get the performance improvements in.

May 01 '24 14:05 garlick

I think it should be doable, my analysis a few days ago suggested we can remove the additional buffering and call the output callback directly.

May 01 '24 15:05 chu11

another thought: since the remote subprocess already buffers data before sending it, what are we gaining by doing it again on the local side? Could we just have the message callback directly invoke the output callback(s)?

Doh! Once I started looking to making actual changes it became obvious. We buffer b/c it's part of the API. The output callback is not passed the output data & length. The user calls flux_subprocess_read() and its variants in the callback and can read as much or as little as they care for. So it needs to be buffered somewhere.

So I don't think this is going to be an optimization option.

May 02 '24 21:05 chu11

I don't think this is going to be an optimization option.

Darn, I didn't notice that either.

Well there are still the other optimizations and the mystery of the reactor loop that wouldn't stop.

May 02 '24 22:05 garlick

Could we add a new function to disable the existing callbacks and register a new one that is passed output as soon as it is ready, including the data and length? It is kind of ugly but could be used to work around this performance issue in the job-exec module.

May 03 '24 13:05 grondo

Could we add a new function to disable the existing callbacks and register a new one that is passed output as soon as it is ready, including the data and length?

Another way to do it would be to add a flag that could be passed to flux_rexec() like FLUX_SUBPROCESS_FLAGS_LOCAL_UNBUF that selects this behavior at the onset so those buffers don't have to be created then destroyed.

May 03 '24 14:05 garlick

That's a good diea too, though we'd still need a different callback prototype:

typedef void (*flux_subprocess_output_f) (flux_subprocess_t *p,
                                          const char *stream);

The original subprocess API was modeled after some existing event-driven interfaces that just pass back an event and the callback handles the event. A bit of an unfortunate choice in hindsight.

May 03 '24 14:05 grondo

That reminds me - I have a subprocess cleanup branch in which i consolidated the stdout, stderr, and channel output callback ops into one (since the stream name is passed into the callback). Maybe that would be a good thing to do before adding an another output type, since otherwise it would require adding three new operations.

May 03 '24 15:05 garlick

ok, I see three paths (some comments were in #5932)

simple "don't start prep/check unless necessary"
- output side - not too bad
- input side - tricky stuff, will take some refactoring work
- liklihood meet release May 7th, ok chance
use fbuf watcher instead, which can handle some prep/check start/stop trickiness when it is empty/non-empty
- output side - assumed this can be done, will probably have to handle some tricky stuff
- input side - dunno how hard it will be to handle tricky parts, could just skip this half
- liklihood meet release May 7th, i'm not confident but possible
remote_output - call remote output callback directly instead of using buffers, will require re-work of output callback function and all callers
- liklihood meet release May 7th: can't imagine this happening

personally, I think option 3 is the best (devil is in the details until I start), but that's the path I think would be best.

May 03 '24 19:05 chu11

flux-core flux-core copied to clipboard

job-exec: possible scaling issue managing many job shells

flux-core
flux-core copied to clipboard