keda icon indicating copy to clipboard operation
keda copied to clipboard

Keda polling doesn't respect license count while queuing azure pipelines

Open jashan05 opened this issue 11 months ago • 15 comments

Report

We are a centralised team which is providing keda agents for the whole organisation within a single cluster. This means we are scaling a lot of keda jobs. Issue : We have encountered an issue where if an Organisation has parallel license count of 1 and they queue 100 jobs, Keda will already start reserving IP's for all the queued jobs. This cause resource lock and also cause our subnets to run out of IP addresses

Expected Behavior

While polling and queuing pods, keda should respect license count available at the Organization level in Azure DevOps

Actual Behavior

Keda doesn't respect the license count and already assign IP to pods created although azure pipelines cannot process the jobs since available license count is not there to handle all the jobs

Steps to Reproduce the Problem

  1. Create an agent pool in Azure DevOps and corresponding namespace in cluster with keda
  2. Have a parallel license count set to a number e.g 1 in azure devops organisation for Self hosted agents
  3. Queue a dummy pipeline 10 times and put a sleep in there e.g for 10 mins

Logs from KEDA operator

example

KEDA Version

2.12.0

Kubernetes Version

< 1.26

Platform

Amazon Web Services

Scaler Details

Azure Pipelines

Anything else?

No response

jashan05 avatar Mar 05 '24 09:03 jashan05

Hello, Currently KEDA doesn't check the license and I'm not sure about if it should do it. How should KEDA handle the overcommitting? I mean, imagine that you have 100 slots, and you deploy 4 ScaledJob with max 40 for example, which is the preference if all of them need more than 25 replicas?

JorTurFer avatar Mar 09 '24 00:03 JorTurFer

@JorTurFer Yes you are right if we have a single scaled job spec. But lets consider the following scenario:

No of Scaled Job Spec License Count
8 with different flavours
of images and different 100
demands

That means I have to set maxReplicaCount = 100 for each scaled job spec as users can use any one of them and it is hard to predict. But this means keda is still querying and queuing 800 pods and if License count is 100 then it is blocking 800 IP's.

Best Regards Jashan Sidhu

jashan05 avatar Mar 14 '24 08:03 jashan05

Yeah, I get your point, but I still don't see how to solve the overcommiting. Let's say that you have 5 ScaledJobs with max 100 because it's the license count but all of them requires 100 because you are in a peak. How should KEDA balance the requirements between them? I mean, you need 500, but you can have just 100, it means that KEDA has to decide the priorities and weights of each ScaledJob. It's not just an autoscaling decision but managing decision.

Although we could measure the amount of pods across all the ScaledJobs, now imagine that 1 of the ScaledJob is locking all the licenses and then you have another jobs queued for other agents. What should KEDA do here? Killing some jobs to make space for the others? lock them until others finish? I mean, there are several decisions here unrelated with the autoscaling itself

WDYT @tomkerkhove @zroubalik ?

JorTurFer avatar Mar 14 '24 10:03 JorTurFer

Hello @JorTurFer I think it should lock them until others are finished or when license count available - license count used > 0. Otherwise it is always going to commit to more resources.

Best Regards Jashan Sidhu

jashan05 avatar Mar 20 '24 19:03 jashan05

I think it should lock them until others are finished

Although all the licenses are locked by a single pool? It could be risky IMHO, but I'd like to see other folks' thoughts. @zroubalik @tomkerkhove @Eldarrin ?

JorTurFer avatar Mar 25 '24 22:03 JorTurFer

The only option I'd see is that KEDA reports the maximum allowed number of licenses to Kubernetes to prevent it from adding more jobs; if we can even do that

tomkerkhove avatar Mar 26 '24 08:03 tomkerkhove

There only possible scenario I can see is that the scalers use a shared state model, but the problem here is that keda is just queuing what ADO pipelines says to queue. ADO says to 10 agents are required, Keda queues 10 agents. If ADO has a license issue saying you can't queue 10 agents then why is ADO stating that 10 agents are required?

So Keda is just doing what its told and any solution we provide is actually just fixing ADO.

Eldarrin avatar Mar 26 '24 08:03 Eldarrin

@JorTurFer No licenses are not locked by single agent pool. With a single API call we can check used, free license count at the org level.

@tomkerkhove Yes I agree.

@Eldarrin I think Keda is not having the same behaviour as Azure DevOps at the moment. There can be jobs in queue but Azure DevOps always checks licenses to assign jobs to an agent. If licenses are not sufficient jobs will be sitting in a queue. IMHO if keda also does that , that solves the problem. I think to achieve this keda needs to check the license count along with the queue and then decide to add a job or not.

jashan05 avatar Mar 26 '24 14:03 jashan05

The problem is with state. Keda scalers are stateless. It just checks the length of the queue and creates enough agents for it; it is not for Keda to check whether items should be in the queue. Also, being stateless even if we checked the licence count each agent will spin up to max of license count; this is the same behaviour as you get by just making maxReplicaCount = Licence Count.

HTH

Eldarrin avatar Mar 26 '24 14:03 Eldarrin

Problem with setting maxReplicaCount = Licence Count is that if you have multiple ( n ) scaled Jobs with different demands then Keda is doing n x maxReplicaCount calls to Azure DevOps to check for queue.

What I think Keda should do is 2 API calls every time, check queued jobs and license details and start the pods accordingly.

e.g to get the license details below is the API call (in python):

def get_license_count(org):
    headers={'Authorization': f'Bearer {oauth_token}',
             'accept': 'application/json;api-version=7.0-preview',
             'Content-type': 'application/json'}
    org_license = requests.get(url=f'https://dev.azure.com/{org}/_apis/distributedtask/resourceusageparallelismTag=Private&poolIsHosted=false&includeRunningRequests=true',
                             headers=headers)
    return {'used_count': org_license.usedCount, 'total_license_count': org_license.resourceLimit.totalCount }

jashan05 avatar Mar 27 '24 09:03 jashan05

I think that makes sense though. Any concerns of adding this call?

tomkerkhove avatar Apr 02 '24 05:04 tomkerkhove

Won't we reach the rate limiting? Maybe with an optional parameter can be a good idea

JorTurFer avatar Apr 02 '24 06:04 JorTurFer

Optional will be good, and it will double the api calls so rate-limits are a concern if you have many scalers variants running

Eldarrin avatar Apr 02 '24 08:04 Eldarrin

Hello everyone,

Could you please let me know how we can proceed on this. Are there any plans to add this functionality.

Best Regards Jashan

jashan05 avatar May 29 '24 12:05 jashan05

Could you please let me know how we can proceed on this. Are there any plans to add this functionality.

We agreed with the approach, but for the implementation we probably need someone willing to contribute with it :)

JorTurFer avatar Jun 24 '24 13:06 JorTurFer