flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

prevent or mitigate jobs writing large files to kvs stdio

Open grondo opened this issue 1 year ago • 9 comments

A production flux instance became unresponsive recently because the KVS was incapacitated trying to return the output eventlog for a job. The job had apparently gone haywire and was writing the same line of output continuously for some time (I think we estimated about 7 million lines). The kvs was unresponsive for the most part (listing keys seemed to work, but not fetching anything, though I could be mistaken about that)

This is the first time we've seen something like this, but we should definitely have some mitigation for jobs inadvertently writing tons of output to the kvs.

Many ideas were already discussed, including:

  • never write any output to the kvs
  • prevent any flux submit or flux run jobs at the system instance level using the require-instance validator. If users do this in their own jobs then they only shoot their own two feet.
  • handle this in the job shell output plugin (and possibly flux job attach input handler) to set a maximum number of lines or bytes for stdio eventlogs
  • do something similar in the KVS, perhaps an io quota per job or guest namespace

grondo avatar May 04 '23 02:05 grondo