flux-core
flux-core copied to clipboard
prevent or mitigate jobs writing large files to kvs stdio
A production flux instance became unresponsive recently because the KVS was incapacitated trying to return the output eventlog for a job. The job had apparently gone haywire and was writing the same line of output continuously for some time (I think we estimated about 7 million lines). The kvs was unresponsive for the most part (listing keys seemed to work, but not fetching anything, though I could be mistaken about that)
This is the first time we've seen something like this, but we should definitely have some mitigation for jobs inadvertently writing tons of output to the kvs.
Many ideas were already discussed, including:
- never write any output to the kvs
- prevent any
flux submitorflux runjobs at the system instance level using the require-instance validator. If users do this in their own jobs then they only shoot their own two feet. - handle this in the job shell output plugin (and possibly
flux job attachinput handler) to set a maximum number of lines or bytes for stdio eventlogs - do something similar in the KVS, perhaps an io quota per job or guest namespace
Another item for the list: I think right now a kvs get always returns the entire value in one message, even if it's stored internally in chunks. It would be more fair to other messages to make kvs get a streaming rpc that returns sequential chunks, since the other messages could be transferred while the big message is streaming.
The kvs design predates streaming rpcs I think.
handle this in the job shell output plugin (and possibly flux job attach input handler) to set a maximum number of lines or bytes for stdio eventlogs
I think its worth noting that the trace of the KVS showed it was busy returning stdout to a caller (I assume flux job attach). This may have simply been the random trace at the time, but I think we could consider there to be two halves to the issue, the writing and reading of the large amount of stdio. It's possible the writing of stdout to the KVS was fine, but the reading it back was not. ~~Especially in the case that the dumping of large stdout was in a guest namespace at first, then later was read back from the system instance's rank 0 broker after it uhhh "reclaimed" the guest namespace.~~ (Edit: oops, i was confusing myself, "guest namespace" does not run on a subinstance broker, it could be on system broker rank 0. So maybe this thought doesn't have much value now.) So on the reading it back side, hypothetically flux job attach just stops outputting after N lines and outputs to the user "too many lines, stopping output" or whatever.
Unfortunately we have limited ability for users to head -n or tail -n their stdio afterwards, which would be nice. (Mildly related, #5035 on the tail -n side)
hypothetically flux job attach just stops outputting after N lines and outputs to the user "too many lines, stopping output" or whatever.
But due to what @garlick said above:
I think right now a kvs get always returns the entire value in one message,
I think we're assuming the problem is that the key is already huge, and the problem is that the kvs is assembling it all at once to return the first result to job-info watch-eventlog handler, so I'm not sure flux job attach can influence this at that time. By the time it can decide "too many lines", the entire object has been fetched from the kvs.
In fact this makes me wonder if there isn't also a problem in job-info. It appears that watch_continuation goes through every line in an eventlog synchronously after it is fetched from the kvs to return each entry to the caller as a separate response. In this case maybe job-info was stuck as well as the kvs. (or maybe I assumed the backtrace was the kvs)
In any case, it feels like limiting what is written to the KVS is the "right" fix here. It doesn't feel right to let users write 1M lines, but then only be able to see 10K. (for example)
I like @garlick's idea of limiting output data in the shell output plugin. RFC 24 even has a Redirect Event, so the shell could potentially post a redirect event and start directing further output to a file after hitting the limit.
We should still open a long term issue to add a limit to the kvs though, since users can add data in arbitrary methods to the guest namespace.
I like @garlick's idea of limiting output data in the shell output plugin. RFC 24 even has a Redirect Event, so the shell could potentially post a redirect event and start directing further output to a file after hitting the limit.
Oh i like this idea!
We should still open a long term issue to add a limit to the kvs though, since users can add data in arbitrary methods to the guest namespace.
Perhaps the long ago #1202 needs to be considered? It would allow job-info to potentially load data piecemeal and uhhhh "rate limit" the data's return. It could also be used to support a crude head or tail ish mechanism.
Just opened flux-framework/rfc#378 on adding a repeat key to the output eventlogs so we could deduplicate output if desired.
I had one other idea here that might be simpler than some of the lower level KVS changes.
What if we allowed eventlog rotation, e.g. for standard io once an eventlog is over some size, an event is posted that tells a watcher that events are continued in another location. Programs with lots of output could have guest.output, guest.output-1, guest.output-2, etc. If the header were to be copied into each new eventlog, a rudimentary "log rotation" could even be supported where the older eventlogs are deleted in favor of new eventlogs (since eventlogs reference the next eventlog in the chain, this could be done by moving the next eventlog referenced over guest.output)
I guess I'm not sure if this is actually any more beneficial than other ideas presented here, but thought I'd throw it out there before I forgot about it.
I'm not sure if it buys us much, but I like the as a potential "user level" solution to #1202, a way to store large values over multiple blobs.
Another idea brought up in #6256 was implementing a max kvs value size. Over there, I suggested
hate to suggest modifying RFC 11 but we could maybe consider adding an optional size to val and valref? (optional in the sense that it is allowed to be missing from older metadata)
If we have a running total size for a kvs value, then we have the means to reject an append that would cause it to exceed some maximum.
Possibly also relevant to a future design discussion: RFC 37 blobvec encoding