cgru icon indicating copy to clipboard operation
cgru copied to clipboard

Low capacity jobs blocking high capacity jobs with higher prio

Open ultra-sonic opened this issue 2 years ago • 6 comments

Hi Timur,

today I am reporting an issue that is bugging us for a while now but is becoming increasingly important right now.

The scenerio is simple: Rendernodes all have a total capacity of 1100 Job 1 has prio 50 and 1000 tasks that need a capacity of 500 each. Job 1 is already started and tasks finish asynchronously leaving just 600 capacity at all times preventing other higher capacity tasks to start. Job 2 has prio 200 and 1 task with that needs 1000 capacity but it can never start bc there is never enough capacity left.

I think this is something that you know about and I can imagine that you already have a solution to this, do you?

Cheers Oli @sebastianelsner

ultra-sonic avatar Apr 20 '22 07:04 ultra-sonic

Hi Oli, (sorry for a delay)

Unfortunately, there is no general solution for such situations. If your renders has 1100 capacity, 500c tasks will never allow 1000c tasks to start.And for now I do not see some simple solution when 500c should "go on pause". But we are using low capacity tasks at work.Our common render capacity is 1500. Task common capacity is 1000.Tasks that have less than 500 capacity should be very light-weight, such tasks will not take an entire farm, or they can take, but for a small period of time. Sometimes a user has a "very heavy" tasks, that can't run in parallel even with a light tasks.In this case the user can set the capacity to 1500 to take all render capacity.

timurhai avatar Apr 25 '22 08:04 timurhai

hello again...this issue is becoming increasingly important at RISE at the moment bc we are about to run more than 1 task per host by default soon. the scenario will look like this: host capacity is equal to number of cores on the host. our renderfarm consists of a wild mix of 8,12,32,40,64,128 and 256 core machines - roughly 800 nodes in total.

i have the following renderjobs in the farm: easy - capacity 8 medium - capacity 64 heavy - capacity 256

as described earlier the problem is that if the heavy job is submitted after the easy and medium jobs it will not start bc the 256 core rendernodes will be busy working on lets say 4 tasks of the medium job. since not all 4 tasks will finish at the same time there will never be enough capacity until all easy and medium jobs are finished.

My temp fix for this would be to limit the max. tasks on all 256 core nodes (that match the jobs hostmask) as long as there are heavy jobs with status RDY. this dynamic limiting could be done via a cron job that runs every minute.

I can imageine that the above temp fix could be intergrated into afserver much more elegant but I realize that this takes some time and maybe you can come up with a much smarter solution for this issue. can you? 😉

One thing that afserver can do which is not that easy to re-implement in a cron job is limiting the max.tasks only on a specific number of hosts based on the "need" of the heavy job, bc I do want medium jobs with higher prio to be scheduled on the 256core nodes if their priority is a lot higher. if we dont take the prio into account then low prio heavy jobs would take a away resources from high prio medium jobs. does that make sense?

cheers Oli

ultra-sonic avatar Mar 02 '23 08:03 ultra-sonic

Hi Timur, sorry to bother you again...could you think of a way to implement this? cheers Oli

ultra-sonic avatar Mar 16 '23 06:03 ultra-sonic

Hi Oliver! Sorry, I did not wrote any answer. But I smoking this!

timurhai avatar Mar 16 '23 10:03 timurhai

Hi Timur, by "smoking this" you mean you are thinking of a solution or is this impossible to implement? We already have a name for it: "The capacity dilemma" 😉

ultra-sonic avatar Mar 27 '23 15:03 ultra-sonic

I am thinking about the solution.

timurhai avatar Mar 29 '23 17:03 timurhai