Wait 60 seconds before scaling down a worker
SUMMARY
From analysis in scale testing, we have found that pool.up() can take on the order of 2 seconds in extremely stressed systems. This kind of number shouldn't be normal, but it is something that occupies time in the main dispatcher process and increases the time taken to process and run a task.
This introduces a new constraint that we don't scale down workers until it has been over 60 seconds since the worker completed its last task. This assures that during temporary churn (scheduled tasks), some idle workers are kept around so scaling up isn't necessary before starting a task.
In practice, it can take significantly longer than this because work allocation is random. This seems okay to me for now.
ISSUE TYPE
- New or Enhanced Feature
COMPONENT NAME
- API
AWX VERSION
ADDITIONAL INFORMATION
How to see this behavior:
- create a JT that sleeps for a few minutes, allow concurrent jobs
- increase the log level of the "scaling" up and down logs so you can see them
- create a workflow with about 5 root nodes from that JT, corresponding to the min_workers number
- launch it
In devel, you see a constant flurry of scaling up and down in this scenario.
With this patch, we only scale up workers when starting, and aside from very occasional noise, don't have any scaling up or down events until the workflow finishes.
The noise is due to the system periodic tasks, which come and go by design. With this, we will effectively keep around spare workers so that we don't generally have to scale up a worker to run those periodic tasks. The idea here is to reduce the amount of unrelated heavy work the system has to do while running resource intensive playbooks. Scaling up a worker will consume many 100s of addition MBs of memory, so we can avoid that here.