roadrunner icon indicating copy to clipboard operation
roadrunner copied to clipboard

[🧹 CHORE]: Autoscale. First look, bugs, proposals

Open trin4ik opened this issue 10 months ago • 11 comments

No duplicates 🥲.

  • [X] I have searched for a similar issue.

What should be improved or cleaned up?

Starting with 2024.3.0 we have autoscale workers 😍 It's a useful feature and, of course, I immediately went to test it out. After a little discussion in Discord (https://discord.com/channels/538114875570913290/1314816983090593803), we came to the conclusion that some things still need to be improved.

1. allocate_timeout is redundant at the moment.

Before autoscale, allocate_timeout was responsible for the startup timeout of the worker. (https://docs.roadrunner.dev/docs/error-codes/allocate-timeout) Now allocate_timeout is also used as a debounce when spawning new workers in autoscale. I.e. before the EventNoFreeWorkers fire the pool waits for allocate_timeout and only then adds workers. The obvious problem is that these should be different options in the configuration, since the timeout for creating a new worker and the delay between creating new workers in the pool are different values. Default allocate_timeout is 60s, for workers startup it might be okay. but not for timeout before allocating new dynamic workers in the pool. its too long. For example, if all workers in working status and we have new lightweight request from user, user will wait allocate_timeout (60 seconds) before pool spawn new workers for users request.

It is suggested that allocate_timeout be split into two options.

  1. allocate_timeout, exactly what it was before.
  2. dynamic_allocator.debounce_timeout, the waiting time when all the wokers are in working status before the EventNoFreeWorkers event. debounce_timeout working title, it may be different.

Questions for the community:

  1. Name of debounce_timeout?
  2. Any suggestions and comments.

2. Sometime need to spawn new workers before EventNoFreeWorkers

If our workers have long-time warmup, like need to open big SQLite, or load AI model, etc, we want to spawn new workers in advance. We're ready for overhead, just as long as it's delay-free for the user. In this case, we want to control spawn new workers before fired EventNoFreeWorkers, for example, when there are less than 2 free workers (status ready).

It is suggested that new options dynamic_allocator.min_ready_workers (working title). If we have min_ready_workers: 2 and in pool we have less than 2 workers in ready status, pool fired EventMinReadyWorkers and spawn new workers from configuration. Of course, the EventMinReadyWorkers event should fire with the debounce_timeout.

Questions for the community:

  1. Name of min_ready_workers?
  2. Need new event like EventMinReadyWorkers, or just fire EventNoFreeWorkers?
  3. Any suggestions and comments.

Bugs:

  1. https://github.com/roadrunner-server/roadrunner/issues/2092
  2. #2111

trin4ik avatar Dec 07 '24 18:12 trin4ik