[Bug]: Many RunPod spot offers are unavailable
Steps to reproduce
> cat .dstack.yml
type: dev-environment
ide: vscode
spot_policy: spot
backends:
- runpod
> dstack apply
Actual behaviour
dstack tries the first 15 offers, but often many or all of them would be unavailable, which may lead to FAILED_TO_START_DUE_TO_NO_CAPACITY.
> dstack apply
Configuration .dstack.yml
Project ilya
User admin
Pool default-pool
Min resources 2..xCPU, 8GB.., 100GB.. (disk)
Max price -
Max duration 6h
Spot policy spot
Retry policy no
Creation policy reuse-or-create
Termination policy destroy-after-idle
Termination idle time 5m
# BACKEND REGION INSTANCE RESOURCES SPOT PRICE
1 runpod EU-SE-1 NVIDIA RTX A4000 9xCPU, 50GB, 1xA4000 (16GB), 100.0GB (disk) yes $0.09
2 runpod CA-MTL-1 NVIDIA RTX A4000 9xCPU, 50GB, 1xA4000 (16GB), 100.0GB (disk) yes $0.09
3 runpod EUR-IS-1 NVIDIA RTX A4000 16xCPU, 31GB, 1xA4000 (16GB), 100.0GB (disk) yes $0.09
...
Shown 3 of 331 offers, $16 max
Submit a new run? [y/n]: y
curvy-parrot-1 provisioning completed (terminating)
All provisioning attempts failed. This is likely due to cloud providers not having enough capacity. Check CLI and server logs for more details
Expected behaviour
Unavailable offers are not displayed in the offers list, provisioning does not fail if RunPod has spare capacity (which it probably does, just not in the 15 offers that are currently tried).
dstack version
master
Server logs
INFO dstack._internal.server.services.backends:404 Requesting instance offers from backends: ['runpod']
DEBUG dstack._internal.server.background.tasks.process_submitted_jobs:379 job(c4919d)curvy-parrot-1-0-0: trying NVIDIA RTX A4000 in
runpod/EU-SE-1 for $0.0900 per hour
[13:17:06] WARNING dstack._internal.server.background.tasks.process_submitted_jobs:399 job(c4919d)curvy-parrot-1-0-0: NVIDIA RTX A4000 launch in
runpod/EU-SE-1 failed: BackendError('There are no longer any instances available with the request specifications. Please try again later.')
DEBUG dstack._internal.server.background.tasks.process_submitted_jobs:379 job(c4919d)curvy-parrot-1-0-0: trying NVIDIA RTX A4000 in
runpod/CA-MTL-1 for $0.0900 per hour
WARNING dstack._internal.server.background.tasks.process_submitted_jobs:399 job(c4919d)curvy-parrot-1-0-0: NVIDIA RTX A4000 launch in
runpod/CA-MTL-1 failed: BackendError('There are no longer any instances available with the request specifications. Please try again
later.')
DEBUG dstack._internal.server.background.tasks.process_submitted_jobs:379 job(c4919d)curvy-parrot-1-0-0: trying NVIDIA RTX A4000 in
runpod/EUR-IS-1 for $0.0900 per hour
[13:17:07] DEBUG dstack._internal.server.background.tasks.process_runs:87 run(a1d6ed)aana-tests: processing run
WARNING dstack._internal.server.background.tasks.process_submitted_jobs:399 job(c4919d)curvy-parrot-1-0-0: NVIDIA RTX A4000 launch in
runpod/EUR-IS-1 failed: BackendError('There are no longer any instances available with the request specifications. Please try again
later.')
DEBUG dstack._internal.server.background.tasks.process_submitted_jobs:379 job(c4919d)curvy-parrot-1-0-0: trying NVIDIA RTX 4000 Ada Generation in
runpod/EUR-IS-1 for $0.1000 per hour
WARNING dstack._internal.server.background.tasks.process_submitted_jobs:399 job(c4919d)curvy-parrot-1-0-0: NVIDIA RTX 4000 Ada Generation launch in
runpod/EUR-IS-1 failed: BackendError('There are no longer any instances available with the request specifications. Please try again
later.')
DEBUG dstack._internal.server.background.tasks.process_submitted_jobs:379 job(c4919d)curvy-parrot-1-0-0: trying NVIDIA RTX 4000 Ada Generation in
runpod/EU-RO-1 for $0.1000 per hour
[13:17:08] WARNING dstack._internal.server.background.tasks.process_submitted_jobs:399 job(c4919d)curvy-parrot-1-0-0: NVIDIA RTX 4000 Ada Generation launch in
runpod/EU-RO-1 failed: BackendError('There are no longer any instances available with the request specifications. Please try again later.')
DEBUG dstack._internal.server.background.tasks.process_submitted_jobs:379 job(c4919d)curvy-parrot-1-0-0: trying NVIDIA RTX A4500 in
runpod/EU-RO-1 for $0.1000 per hour
[13:17:09] WARNING dstack._internal.server.background.tasks.process_submitted_jobs:399 job(c4919d)curvy-parrot-1-0-0: NVIDIA RTX A4500 launch in
runpod/EU-RO-1 failed: BackendError('There are no longer any instances available with the request specifications. Please try again later.')
DEBUG dstack._internal.server.background.tasks.process_submitted_jobs:379 job(c4919d)curvy-parrot-1-0-0: trying NVIDIA RTX A5000 in
runpod/EU-SE-1 for $0.1100 per hour
DEBUG dstack._internal.server.background.tasks.process_running_jobs:249 job(487481)aana-tests-0-0: process running job, age=0:19:50.423037
WARNING dstack._internal.server.background.tasks.process_submitted_jobs:399 job(c4919d)curvy-parrot-1-0-0: NVIDIA RTX A5000 launch in
runpod/EU-SE-1 failed: BackendError('There are no longer any instances available with the request specifications. Please try again later.')
DEBUG dstack._internal.server.background.tasks.process_submitted_jobs:379 job(c4919d)curvy-parrot-1-0-0: trying NVIDIA RTX A5000 in
runpod/CA-MTL-1 for $0.1100 per hour
[13:17:10] WARNING dstack._internal.server.background.tasks.process_submitted_jobs:399 job(c4919d)curvy-parrot-1-0-0: NVIDIA RTX A5000 launch in
runpod/CA-MTL-1 failed: BackendError('There are no longer any instances available with the request specifications. Please try again
later.')
DEBUG dstack._internal.server.background.tasks.process_submitted_jobs:379 job(c4919d)curvy-parrot-1-0-0: trying NVIDIA RTX 2000 Ada Generation in
runpod/EUR-IS-1 for $0.1400 per hour
DEBUG dstack._internal.server.app:213 Processed request POST http://0.0.0.0:3000/api/project/ilya/runs/get in 0.038164s
WARNING dstack._internal.server.background.tasks.process_submitted_jobs:399 job(c4919d)curvy-parrot-1-0-0: NVIDIA RTX 2000 Ada Generation launch in
runpod/EUR-IS-1 failed: BackendError('There are no longer any instances available with the request specifications. Please try again
later.')
DEBUG dstack._internal.server.background.tasks.process_submitted_jobs:379 job(c4919d)curvy-parrot-1-0-0: trying NVIDIA RTX 2000 Ada Generation in
runpod/EU-RO-1 for $0.1400 per hour
[13:17:11] WARNING dstack._internal.server.background.tasks.process_submitted_jobs:399 job(c4919d)curvy-parrot-1-0-0: NVIDIA RTX 2000 Ada Generation launch in
runpod/EU-RO-1 failed: BackendError('There are no longer any instances available with the request specifications. Please try again later.')
DEBUG dstack._internal.server.background.tasks.process_submitted_jobs:379 job(c4919d)curvy-parrot-1-0-0: trying NVIDIA RTX A4000 in
runpod/EU-SE-1 for $0.1600 per hour
[13:17:12] WARNING dstack._internal.server.background.tasks.process_submitted_jobs:399 job(c4919d)curvy-parrot-1-0-0: NVIDIA RTX A4000 launch in
runpod/EU-SE-1 failed: BackendError('There are no longer any instances available with the request specifications. Please try again later.')
DEBUG dstack._internal.server.background.tasks.process_submitted_jobs:379 job(c4919d)curvy-parrot-1-0-0: trying NVIDIA RTX A4000 in
runpod/CA-MTL-1 for $0.1600 per hour
WARNING dstack._internal.server.background.tasks.process_submitted_jobs:399 job(c4919d)curvy-parrot-1-0-0: NVIDIA RTX A4000 launch in
runpod/CA-MTL-1 failed: BackendError('There are no longer any instances available with the request specifications. Please try again
later.')
DEBUG dstack._internal.server.background.tasks.process_submitted_jobs:379 job(c4919d)curvy-parrot-1-0-0: trying NVIDIA RTX A4000 in
runpod/EUR-IS-1 for $0.1600 per hour
[13:17:13] WARNING dstack._internal.server.background.tasks.process_submitted_jobs:399 job(c4919d)curvy-parrot-1-0-0: NVIDIA RTX A4000 launch in
runpod/EUR-IS-1 failed: BackendError('There are no longer any instances available with the request specifications. Please try again
later.')
DEBUG dstack._internal.server.background.tasks.process_submitted_jobs:379 job(c4919d)curvy-parrot-1-0-0: trying NVIDIA RTX A4000 in
runpod/CA-MTL-1 for $0.1800 per hour
WARNING dstack._internal.server.background.tasks.process_submitted_jobs:399 job(c4919d)curvy-parrot-1-0-0: NVIDIA RTX A4000 launch in
runpod/CA-MTL-1 failed: BackendError('There are no longer any instances available with the request specifications. Please try again
later.')
DEBUG dstack._internal.server.background.tasks.process_submitted_jobs:379 job(c4919d)curvy-parrot-1-0-0: trying NVIDIA RTX A4000 in
runpod/EUR-IS-1 for $0.1800 per hour
DEBUG dstack._internal.server.background.tasks.process_running_jobs:249 job(487481)aana-tests-0-0: process running job, age=0:19:55.001004
[13:17:14] WARNING dstack._internal.server.background.tasks.process_submitted_jobs:399 job(c4919d)curvy-parrot-1-0-0: NVIDIA RTX A4000 launch in
runpod/EUR-IS-1 failed: BackendError('There are no longer any instances available with the request specifications. Please try again
later.')
DEBUG dstack._internal.server.background.tasks.process_submitted_jobs:233 job(c4919d)curvy-parrot-1-0-0: provisioning failed
[13:17:15] DEBUG dstack._internal.server.background.tasks.process_runs:87 run(398045)curvy-parrot-1: processing run
INFO dstack._internal.server.background.tasks.process_runs:330 run(398045)curvy-parrot-1: run status has changed SUBMITTED -> TERMINATING
Additional information
Reproducibility depends on current offers availability in RunPod. If you can't reproduce, try running multiple configurations in parallel to exhaust availability artificially. Even if your run does start, check server logs to see how many offers were skipped - skipping too many offers is not expected.
I think an easy solution would be to try more offers. The limit of 15 used to make a lot of sense when the server could not process multiple jobs in parallel. Now we can allow a submitted job to be processed for a longer time. Still, to avoid trying thousands of offers, there should be a limit. I think something like 50 should work but it needs some experimentation with runpod to see if the issue gets fixes.
This issue is stale because it has been open for 30 days with no activity.
This issue was closed because it has been inactive for 14 days since being marked as stale. Please reopen the issue if it is still relevant.
@jvstme is this issue still relevant?
Even more relevant now with the introduction of the Community Cloud. It seems that many spot GPUs are not actually available at the price returned by the RunPod API and can only be deployed at a higher bid. I've reached out to the RunPod team to see if this is the case and if the accuracy of spot prices in the API can be improved.
RunPod has recognized this issue, we are waiting for a fix from their end
This issue is stale because it has been open for 30 days with no activity.
This issue was closed because it has been inactive for 14 days since being marked as stale. Please reopen the issue if it is still relevant.
Just throwing in an extra data point on this story which I believe is relevant to this but from a different backend.
My company has drastically limited the types of instances I can run on Azure. Blocking, I would guess, 90% of the instance types available.
As such, when I create a task asking setting spot_policy to 'auto'. If the spot version of one of the few types I do have is unavailable, dstack begins an exhaustive search through tons of more expensive spot types which are unavailable to me, but still cheaper than the same type I do have access to on_demand.
Because the offer check limit (which now appears to be 25 by my logs) is insufficient to dig through all cleaper instance types, my job never begins and I get stuck on FAILED_TO_START_DUE_TO_NO_CAPACITY
I think any fix on the dstack side to this bug would fix mine as well, so I didn't want to create a duplicate bug.
Obviously for me I have at least one potential workaround: Poke my IT department to free up access to more spot types. But for the moment my only recourse is to abandon spot_policy: auto for spot_policy: on_demand to ensure my jobs are run.
@jvstme Also, we allow limiting instance_types directly in a run configuration for Azure too, right?
@ASmedberg-woolpert, this ticket is exclusive to the RunPod backend, we expect it to be fixed on the RunPod side.
Regarding the issue you're experiencing with Azure, I've suggested some solutions on Discord.