Wait 60 seconds before scaling down a worker

Open AlanCoding opened this issue 3 years ago • 0 comments

SUMMARY

From analysis in scale testing, we have found that pool.up() can take on the order of 2 seconds in extremely stressed systems. This kind of number shouldn't be normal, but it is something that occupies time in the main dispatcher process and increases the time taken to process and run a task.

This introduces a new constraint that we don't scale down workers until it has been over 60 seconds since the worker completed its last task. This assures that during temporary churn (scheduled tasks), some idle workers are kept around so scaling up isn't necessary before starting a task.

In practice, it can take significantly longer than this because work allocation is random. This seems okay to me for now.

ISSUE TYPE

New or Enhanced Feature

COMPONENT NAME

AWX VERSION

ADDITIONAL INFORMATION

How to see this behavior:

create a JT that sleeps for a few minutes, allow concurrent jobs
increase the log level of the "scaling" up and down logs so you can see them
create a workflow with about 5 root nodes from that JT, corresponding to the min_workers number
launch it

In devel, you see a constant flurry of scaling up and down in this scenario.

With this patch, we only scale up workers when starting, and aside from very occasional noise, don't have any scaling up or down events until the workflow finishes.

The noise is due to the system periodic tasks, which come and go by design. With this, we will effectively keep around spare workers so that we don't generally have to scale up a worker to run those periodic tasks. The idea here is to reduce the amount of unrelated heavy work the system has to do while running resource intensive playbooks. Scaling up a worker will consume many 100s of addition MBs of memory, so we can avoid that here.

Jul 27 '22 21:07 AlanCoding