qiskit-serverless icon indicating copy to clipboard operation
qiskit-serverless copied to clipboard

Resource requests for multiple jobs limited by first one submitted

Open psschwei opened this issue 1 year ago • 1 comments
trafficstars

Steps to reproduce the problem

Run two serverless jobs concurrently, the first one using 1 worker and the second one using 3.

Using the basic getting started running_program.ipynb notebook, make the following updates:

To ensure jobs run concurrently, add a pause in source_files/pattern.py around L19:

import time
time.sleep(120)

Then, update the running the pattern section of the notebook to launch jobs with different resource configurations:

from quantum_serverless import Configuration
job = serverless.run("my-first-pattern")
job2 = serverless.run("my-first-pattern", config=Configuration(workers=3))

(note: I've also tried this when setting auto-scaling to true on both jobs with no change in behavior)

What is the current behavior?

A Ray cluster with two pods (one head, one worker) are launched and both workloads are run on that cluster.

$ k get po
NAME                                 READY   STATUS    RESTARTS   AGE
c-mockuser-a1a35d28-head-4pf5d       2/2     Running   0          95s
c-mockuser-a1a35d28-worker-g-m7qgm   1/1     Running   0          95s
gateway-796cfb4d5b-cbslp             1/1     Running   0          6m50s
kuberay-operator-654bf75dcb-4tvcw    1/1     Running   0          6m50s
postgresql-0                         1/1     Running   0          6m50s
prepuller-5v2xg                      1/1     Running   0          6m50s
scheduler-fbb99cb54-mlxrt            1/1     Running   0          6m50s

What is the expected behavior?

At a minimum, I would expect the cluster to be resized to add the additional requested workers.

Not sure if the best behavior would be to start a new Ray cluster for the additional job, given the differing resource requests. I could see arguments in favor of both approaches...

How to Fix

When a job is submitted, we check to see if there's an existing compute resource:

https://github.com/Qiskit-Extensions/quantum-serverless/blob/7dfe5bbe644b15cca75691639fb898ce0f2da53e/gateway/api/schedule.py#L43-L45

and if so, we reuse it:

https://github.com/Qiskit-Extensions/quantum-serverless/blob/7dfe5bbe644b15cca75691639fb898ce0f2da53e/gateway/api/schedule.py#L49-L53

In some use cases, this may not be the ideal behavior, so we may want to revisit this decision...

psschwei avatar Mar 01 '24 15:03 psschwei

Probably @IceKhan13 can give us a better insight but if I remember correctly something that it surprised me in the past is that Ray tries to allocate the workload where it can. So if you have two jobs that will consume 4 cpus and you have a cluster with 4 cpus available it will try to run those two jobs in that cluster instead of create two clusters or more workers. So maybe it's happening something similar in this case.

Said that, I never tried to run any workload with that configuration. I think it's a good idea to start testing different configurations for the workloads and monitor the behavior.

Tansito avatar Mar 06 '24 21:03 Tansito

This will be fixed by #1337

psschwei avatar Jul 19 '24 15:07 psschwei