flux-core
flux-core copied to clipboard
prevent or mitigate jobs writing large files to kvs stdio
A production flux instance became unresponsive recently because the KVS was incapacitated trying to return the output eventlog for a job. The job had apparently gone haywire and was writing the same line of output continuously for some time (I think we estimated about 7 million lines). The kvs was unresponsive for the most part (listing keys seemed to work, but not fetching anything, though I could be mistaken about that)
This is the first time we've seen something like this, but we should definitely have some mitigation for jobs inadvertently writing tons of output to the kvs.
Many ideas were already discussed, including:
- never write any output to the kvs
- prevent any
flux submit
orflux run
jobs at the system instance level using the require-instance validator. If users do this in their own jobs then they only shoot their own two feet. - handle this in the job shell output plugin (and possibly
flux job attach
input handler) to set a maximum number of lines or bytes for stdio eventlogs - do something similar in the KVS, perhaps an io quota per job or guest namespace