wr icon indicating copy to clipboard operation
wr copied to clipboard

OpenStack: Watch host resources to identify lockups

Open keiranmraine opened this issue 7 years ago • 2 comments

May apply elsewhere but where I've seen issues.

I've seen a few instances where the underlying job can get stuck. Could WR watch the 15 minute load avg and kill a host if below a user defined threshold (deploy setting)?

For my use case watching the actual job isn't useful as docker usage isn't directly linked to the wr job (looking into if there is a work around).

keiranmraine avatar Mar 23 '17 19:03 keiranmraine

Can you offer some standard unix commands for determining the 15 minute load avg? Won't this trigger during normal heavy I/O parts of your pipelines, like staging in data, or during the upload of results?

sb10 avatar Mar 24 '17 09:03 sb10

$ top -b -n 1  | head -n 1
top - 10:54:07 up 35 min,  1 user,  load average: 0.62, 0.58, 0.59
$ cat /proc/loadavg 
0.65 0.59 0.59 1/442 2256

So staging and upload from/to S3 certainly a fair amount of 1 CPU. I'm thinking along the lines of 60 minutes of a load average below 0.02. WR runner would need to monitor it over the longer range, if the 15 minute load increases at any point in the in the last 4 data points it should disregard. If it only falls (all must be below the default/user defined value) then kill or flag through interface.

An ideal option would be to flag the jobs via the interface and give the user the following options through that route:

  • Kill the host & bury jobs
  • Kill the host retry all jobs on it at that time (regardless of retry value)
  • (if no action leave it alone)

keiranmraine avatar Mar 24 '17 10:03 keiranmraine