flux-core
flux-core copied to clipboard
libsubprocess: support FLUX_SUBPROCESS_FLAGS_LOCAL_UNBUF
Per discussion in #5919, some notes
- built on top of my cleanups in #5974
- should the use of this flag in job-exec be a different PR?
- in job-exec subprocess flags are passed in via the API to bulk-exec, so it felt wrong to just assume that flag is always set. So that led to adding a new
flux_subprocess_get_flags()
function and calling an appropriate "read" function depending on flags. This could be done differently of course, perhaps the flags aren't passed in from the API and are internalized withinbulk-exec
? - haven't tested to ensure that the performance profile changes as expected, dunno if we have a simple reproducer laying around?
re-pushed given the comments above, did an additional profiling run and it showed some smalll incremental improvements.
re-pushed w/ tweaks per comments above. Only notable thing is kept EPERM vs EINVAL for the "can't use these functions" error case.
Sounds good. I'll get this on my test system that is configured for sdexec and see how it goes.
One concern is the lack of output buffering in sdexec
, which in combination with this flag means no buffering at all. I'll try it and we should also audit job-exec
with that in mind to see if we can spot any problem areas. We may need to add output buffering to sdexec
before switching job-exec
over if there are problems.
The shell barrier protocol is fine since I believe it's the only user of stdout, and the protocol consists of the shell sending a line of output and then job exec sending one line back. There is not much opportunity for buffering (or lack thereof) to screw that up.
@grondo suggested I print something from a userrc to check how job-exec
handles shell output on stderr.
-- foo.lua
io.stderr:write("foo bar\n")
io.stderr:write("foo bar\n")
io.stderr:write("foo bar\n")
io.stderr:write("foo bar\n")
With master
$ flux run -o userrc=foo.lua true
0.040s: flux-shell[0]: stderr: foo bar
0.040s: flux-shell[0]: stderr: foo bar
0.040s: flux-shell[0]: stderr: foo bar
0.040s: flux-shell[0]: stderr: foo bar
$ flux job eventlog -H -p exec $(flux job last) | grep log
[ +0.017012] log component="flux-shell" stream="stderr" rank="0" data="foo bar\n"
[ +0.017024] log component="flux-shell" stream="stderr" rank="0" data="foo bar\n"
[ +0.017031] log component="flux-shell" stream="stderr" rank="0" data="foo bar\n"
[ +0.017038] log component="flux-shell" stream="stderr" rank="0" data="foo bar\n"
With this PR there is evidence that lines are not being buffered. Surprisingly, this happens with and without sdexec. I thought we had remote line buffering with the normal subprocess server.
$ flux run -o userrc=foo true
0.064s: flux-shell[0]: stderr: foo bar
foo bar
foo bar
foo bar
$ flux job eventlog -H -p exec $(flux job last) | grep log
[ +0.033395] log component="flux-shell" stream="stderr" rank="0" data="foo bar\nfoo bar\nfoo bar\nfoo bar\n"
Oh! Even though the remote subprocess is line buffered (the default), we are using flux_subprocess_read()
in the server to consume data from the buffer and put it into messages. We need to use the line oriented read functions so that we are sending only one line per message or else the benefits of line buffering are lost. I guess the server output callback needs to check the stream's LINE_BUFFER option setting and then use the appropriate read function.
This seems to fix that problem.
diff --git a/src/common/libsubprocess/server.c b/src/common/libsubprocess/server.c
index fca01444e..139923f23 100644
--- a/src/common/libsubprocess/server.c
+++ b/src/common/libsubprocess/server.c
@@ -246,7 +246,10 @@ static void proc_output_cb (flux_subprocess_t *p, const char *stream)
const char *buf;
int len;
- if ((len = flux_subprocess_read (p, stream, &buf)) < 0) {
+ len = flux_subprocess_getline (p, stream, &buf);
+ if (len < 0 && errno == EPERM) // not line buffered
+ len = flux_subprocess_read (p, stream, &buf);
+ if (len < 0) {
llog_error (s,
"error reading from subprocess stream %s: %s",
stream,
This seems to fix that problem.
Hmmmm. There's a part of me wondering if I originally messed up and had intended to NOT line buffer on the server side, i.e. all line buffer handling occurs on the client side.
But now that we have LOCAL_UNBUF, perhaps that no longer matters and the approach you have is what needs to be done.
I guess it's a trade off. The client has turned out to be pretty heavy with the buffering on that end. OTOH if we buffer on the server end then we can end up with more message framing overhead when there are bursts of messages. But since it's a common idiom to have many clients in one place spread out across many servers, moving the buffering to the server side probably makes sense.
Should we add the server side change in this PR and also maybe a test to show that we get one callback per line in UNBUF mode with the libsubprocess server? Then we could follow up with a PR for sdexec to add some buffering to it. I can look at that.
re-pushed, adding commit to update server side to send each line buffered chunk and added a multiline test.
hmmm alpine builder failed
Error: not ok 13 - job-exec: can specify default-shell on cmdline
Error: not ok 16 - job-exec: update default shell via config reload
Error: not ok 17 - job-exec: cmdline default shell takes priority
Error: not ok 18 - job-exec: can specify imp path on cmdline
Error: not ok 21 - job-exec: update imp path via config reload
Error: not ok 22 - job-exec: cmdline imp path takes priority
Error: not ok 23 - job-exec: can specify exec service on cmdline
Error: not ok 25 - job-exec: update exec service via config reload
Error: not ok 26 - job-exec: cmdline exec service takes priority
Error: not ok 28 - job-exec: update exec service override via config reload
Error: not ok 32 - job-exec: update sdexec properties via config reload
Error: ERROR: t2403-job-exec-conf.t - exited with status 1
these are all job-exec reload module tests. Wondering if the recent additions kill-timeout, etc. have issues when the module is reloaded. on alpine, the globals may not be reset.
Here's one troubling failure
flux-module: job-exec.stats-get: Out of memory
not ok 13 - job-exec: can specify default-shell on cmdline
I'll restart the builder, this obviously passed on other runs allowing this to be merged.
Edit: failed again same way
without any better ideas, re-pushed with an extra commit that re-initializes the new signaling globals. I'm guessing my changes to job-exec tickled things enough to make things fail.
Ah, yeah, that makes sense. Probably we were only lucky to get away with it previously.
oh whoops, i should have removed merge-when-passing
until that new commit was reviewed .... hopefully it's ok :-) here it is
https://github.com/flux-framework/flux-core/pull/5975/commits/a35f615a06693c23a838adf804e93f993e32faa1
Well that change overwrites the previous setting on any configuration reload with the defaults. However, probably not a big deal since these are mostly used only for testing at this point...
Well that change overwrites the previous setting on any configuration reload with the defaults. However, probably not a big deal since these are mostly used only for testing at this point...
Gah ... that's right, should have passed the saved argc/argv into it like we do with the exec_config ... I'll do a fix.
if (job_exec_set_config_globals (h, conf, 0, NULL, &err) < 0)
goto error;
while ((impl = implementations[i]) && impl->name) {
if (impl->config) {
if ((*impl->config) (h, conf, ctx->argc, ctx->argv, &err) < 0)
goto error;
}
i++;
}
Codecov Report
Attention: Patch coverage is 76.11940%
with 16 lines
in your changes missing coverage. Please review.
Project coverage is 83.26%. Comparing base (
9e84f0f
) to head (a35f615
). Report is 500 commits behind head on master.
Files with missing lines | Patch % | Lines |
---|---|---|
src/common/libsubprocess/remote.c | 62.50% | 15 Missing :warning: |
src/modules/job-exec/bulk-exec.c | 66.66% | 1 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## master #5975 +/- ##
=======================================
Coverage 83.26% 83.26%
=======================================
Files 518 518
Lines 83468 83513 +45
=======================================
+ Hits 69502 69541 +39
- Misses 13966 13972 +6
Files with missing lines | Coverage Δ | |
---|---|---|
src/common/libsubprocess/server.c | 78.86% <100.00%> (+0.19%) |
:arrow_up: |
src/common/libsubprocess/subprocess.c | 89.56% <100.00%> (+0.63%) |
:arrow_up: |
src/modules/job-exec/job-exec.c | 76.99% <100.00%> (+0.09%) |
:arrow_up: |
src/modules/job-exec/bulk-exec.c | 77.86% <66.66%> (ø) |
|
src/common/libsubprocess/remote.c | 76.62% <62.50%> (-0.13%) |
:arrow_down: |