dstack icon indicating copy to clipboard operation
dstack copied to clipboard

[Bug]: Many RunPod spot offers are unavailable

Open jvstme opened this issue 1 year ago • 11 comments

Steps to reproduce

> cat .dstack.yml 
type: dev-environment
ide: vscode
spot_policy: spot
backends:
  - runpod

> dstack apply

Actual behaviour

dstack tries the first 15 offers, but often many or all of them would be unavailable, which may lead to FAILED_TO_START_DUE_TO_NO_CAPACITY.

> dstack apply
 Configuration          .dstack.yml                
 Project                ilya                           
 User                   admin                          
 Pool                   default-pool                   
 Min resources          2..xCPU, 8GB.., 100GB.. (disk) 
 Max price              -                              
 Max duration           6h                             
 Spot policy            spot                           
 Retry policy           no                             
 Creation policy        reuse-or-create                
 Termination policy     destroy-after-idle             
 Termination idle time  5m                             

 #  BACKEND  REGION    INSTANCE          RESOURCES                                     SPOT  PRICE   
 1  runpod   EU-SE-1   NVIDIA RTX A4000  9xCPU, 50GB, 1xA4000 (16GB), 100.0GB (disk)   yes   $0.09   
 2  runpod   CA-MTL-1  NVIDIA RTX A4000  9xCPU, 50GB, 1xA4000 (16GB), 100.0GB (disk)   yes   $0.09   
 3  runpod   EUR-IS-1  NVIDIA RTX A4000  16xCPU, 31GB, 1xA4000 (16GB), 100.0GB (disk)  yes   $0.09   
    ...                                                                                              
 Shown 3 of 331 offers, $16 max

Submit a new run? [y/n]: y
curvy-parrot-1 provisioning completed (terminating)
All provisioning attempts failed. This is likely due to cloud providers not having enough capacity. Check CLI and server logs for more details

Expected behaviour

Unavailable offers are not displayed in the offers list, provisioning does not fail if RunPod has spare capacity (which it probably does, just not in the 15 offers that are currently tried).

dstack version

master

Server logs

INFO     dstack._internal.server.services.backends:404 Requesting instance offers from backends: ['runpod']                                         
           DEBUG    dstack._internal.server.background.tasks.process_submitted_jobs:379 job(c4919d)curvy-parrot-1-0-0: trying NVIDIA RTX A4000 in              
                    runpod/EU-SE-1 for $0.0900 per hour                                                                                                        
[13:17:06] WARNING  dstack._internal.server.background.tasks.process_submitted_jobs:399 job(c4919d)curvy-parrot-1-0-0: NVIDIA RTX A4000 launch in              
                    runpod/EU-SE-1 failed: BackendError('There are no longer any instances available with the request specifications. Please try again later.')
           DEBUG    dstack._internal.server.background.tasks.process_submitted_jobs:379 job(c4919d)curvy-parrot-1-0-0: trying NVIDIA RTX A4000 in              
                    runpod/CA-MTL-1 for $0.0900 per hour                                                                                                       
           WARNING  dstack._internal.server.background.tasks.process_submitted_jobs:399 job(c4919d)curvy-parrot-1-0-0: NVIDIA RTX A4000 launch in              
                    runpod/CA-MTL-1 failed: BackendError('There are no longer any instances available with the request specifications. Please try again        
                    later.')                                                                                                                                   
           DEBUG    dstack._internal.server.background.tasks.process_submitted_jobs:379 job(c4919d)curvy-parrot-1-0-0: trying NVIDIA RTX A4000 in              
                    runpod/EUR-IS-1 for $0.0900 per hour                                                                                                       
[13:17:07] DEBUG    dstack._internal.server.background.tasks.process_runs:87 run(a1d6ed)aana-tests: processing run                                             
           WARNING  dstack._internal.server.background.tasks.process_submitted_jobs:399 job(c4919d)curvy-parrot-1-0-0: NVIDIA RTX A4000 launch in              
                    runpod/EUR-IS-1 failed: BackendError('There are no longer any instances available with the request specifications. Please try again        
                    later.')                                                                                                                                   
           DEBUG    dstack._internal.server.background.tasks.process_submitted_jobs:379 job(c4919d)curvy-parrot-1-0-0: trying NVIDIA RTX 4000 Ada Generation in
                    runpod/EUR-IS-1 for $0.1000 per hour                                                                                                       
           WARNING  dstack._internal.server.background.tasks.process_submitted_jobs:399 job(c4919d)curvy-parrot-1-0-0: NVIDIA RTX 4000 Ada Generation launch in
                    runpod/EUR-IS-1 failed: BackendError('There are no longer any instances available with the request specifications. Please try again        
                    later.')                                                                                                                                   
           DEBUG    dstack._internal.server.background.tasks.process_submitted_jobs:379 job(c4919d)curvy-parrot-1-0-0: trying NVIDIA RTX 4000 Ada Generation in
                    runpod/EU-RO-1 for $0.1000 per hour                                                                                                        
[13:17:08] WARNING  dstack._internal.server.background.tasks.process_submitted_jobs:399 job(c4919d)curvy-parrot-1-0-0: NVIDIA RTX 4000 Ada Generation launch in
                    runpod/EU-RO-1 failed: BackendError('There are no longer any instances available with the request specifications. Please try again later.')
           DEBUG    dstack._internal.server.background.tasks.process_submitted_jobs:379 job(c4919d)curvy-parrot-1-0-0: trying NVIDIA RTX A4500 in              
                    runpod/EU-RO-1 for $0.1000 per hour                                                                                                        
[13:17:09] WARNING  dstack._internal.server.background.tasks.process_submitted_jobs:399 job(c4919d)curvy-parrot-1-0-0: NVIDIA RTX A4500 launch in              
                    runpod/EU-RO-1 failed: BackendError('There are no longer any instances available with the request specifications. Please try again later.')
           DEBUG    dstack._internal.server.background.tasks.process_submitted_jobs:379 job(c4919d)curvy-parrot-1-0-0: trying NVIDIA RTX A5000 in              
                    runpod/EU-SE-1 for $0.1100 per hour                                                                                                        
           DEBUG    dstack._internal.server.background.tasks.process_running_jobs:249 job(487481)aana-tests-0-0: process running job, age=0:19:50.423037       
           WARNING  dstack._internal.server.background.tasks.process_submitted_jobs:399 job(c4919d)curvy-parrot-1-0-0: NVIDIA RTX A5000 launch in              
                    runpod/EU-SE-1 failed: BackendError('There are no longer any instances available with the request specifications. Please try again later.')
           DEBUG    dstack._internal.server.background.tasks.process_submitted_jobs:379 job(c4919d)curvy-parrot-1-0-0: trying NVIDIA RTX A5000 in              
                    runpod/CA-MTL-1 for $0.1100 per hour                                                                                                       
[13:17:10] WARNING  dstack._internal.server.background.tasks.process_submitted_jobs:399 job(c4919d)curvy-parrot-1-0-0: NVIDIA RTX A5000 launch in              
                    runpod/CA-MTL-1 failed: BackendError('There are no longer any instances available with the request specifications. Please try again        
                    later.')                                                                                                                                   
           DEBUG    dstack._internal.server.background.tasks.process_submitted_jobs:379 job(c4919d)curvy-parrot-1-0-0: trying NVIDIA RTX 2000 Ada Generation in
                    runpod/EUR-IS-1 for $0.1400 per hour                                                                                                       
           DEBUG    dstack._internal.server.app:213 Processed request POST http://0.0.0.0:3000/api/project/ilya/runs/get in 0.038164s                          
           WARNING  dstack._internal.server.background.tasks.process_submitted_jobs:399 job(c4919d)curvy-parrot-1-0-0: NVIDIA RTX 2000 Ada Generation launch in
                    runpod/EUR-IS-1 failed: BackendError('There are no longer any instances available with the request specifications. Please try again        
                    later.')                                                                                                                                   
           DEBUG    dstack._internal.server.background.tasks.process_submitted_jobs:379 job(c4919d)curvy-parrot-1-0-0: trying NVIDIA RTX 2000 Ada Generation in
                    runpod/EU-RO-1 for $0.1400 per hour                                                                                                        
[13:17:11] WARNING  dstack._internal.server.background.tasks.process_submitted_jobs:399 job(c4919d)curvy-parrot-1-0-0: NVIDIA RTX 2000 Ada Generation launch in
                    runpod/EU-RO-1 failed: BackendError('There are no longer any instances available with the request specifications. Please try again later.')
           DEBUG    dstack._internal.server.background.tasks.process_submitted_jobs:379 job(c4919d)curvy-parrot-1-0-0: trying NVIDIA RTX A4000 in              
                    runpod/EU-SE-1 for $0.1600 per hour                                                                                                        
[13:17:12] WARNING  dstack._internal.server.background.tasks.process_submitted_jobs:399 job(c4919d)curvy-parrot-1-0-0: NVIDIA RTX A4000 launch in              
                    runpod/EU-SE-1 failed: BackendError('There are no longer any instances available with the request specifications. Please try again later.')
           DEBUG    dstack._internal.server.background.tasks.process_submitted_jobs:379 job(c4919d)curvy-parrot-1-0-0: trying NVIDIA RTX A4000 in              
                    runpod/CA-MTL-1 for $0.1600 per hour                                                                                                       
           WARNING  dstack._internal.server.background.tasks.process_submitted_jobs:399 job(c4919d)curvy-parrot-1-0-0: NVIDIA RTX A4000 launch in              
                    runpod/CA-MTL-1 failed: BackendError('There are no longer any instances available with the request specifications. Please try again        
                    later.')                                                                                                                                   
           DEBUG    dstack._internal.server.background.tasks.process_submitted_jobs:379 job(c4919d)curvy-parrot-1-0-0: trying NVIDIA RTX A4000 in              
                    runpod/EUR-IS-1 for $0.1600 per hour                                                                                                       
[13:17:13] WARNING  dstack._internal.server.background.tasks.process_submitted_jobs:399 job(c4919d)curvy-parrot-1-0-0: NVIDIA RTX A4000 launch in              
                    runpod/EUR-IS-1 failed: BackendError('There are no longer any instances available with the request specifications. Please try again        
                    later.')                                                                                                                                   
           DEBUG    dstack._internal.server.background.tasks.process_submitted_jobs:379 job(c4919d)curvy-parrot-1-0-0: trying NVIDIA RTX A4000 in              
                    runpod/CA-MTL-1 for $0.1800 per hour                                                                                                       
           WARNING  dstack._internal.server.background.tasks.process_submitted_jobs:399 job(c4919d)curvy-parrot-1-0-0: NVIDIA RTX A4000 launch in              
                    runpod/CA-MTL-1 failed: BackendError('There are no longer any instances available with the request specifications. Please try again        
                    later.')                                                                                                                                   
           DEBUG    dstack._internal.server.background.tasks.process_submitted_jobs:379 job(c4919d)curvy-parrot-1-0-0: trying NVIDIA RTX A4000 in              
                    runpod/EUR-IS-1 for $0.1800 per hour                                                                                                       
           DEBUG    dstack._internal.server.background.tasks.process_running_jobs:249 job(487481)aana-tests-0-0: process running job, age=0:19:55.001004       
[13:17:14] WARNING  dstack._internal.server.background.tasks.process_submitted_jobs:399 job(c4919d)curvy-parrot-1-0-0: NVIDIA RTX A4000 launch in              
                    runpod/EUR-IS-1 failed: BackendError('There are no longer any instances available with the request specifications. Please try again        
                    later.')                                                                                                                                   
           DEBUG    dstack._internal.server.background.tasks.process_submitted_jobs:233 job(c4919d)curvy-parrot-1-0-0: provisioning failed                     
[13:17:15] DEBUG    dstack._internal.server.background.tasks.process_runs:87 run(398045)curvy-parrot-1: processing run                                         
           INFO     dstack._internal.server.background.tasks.process_runs:330 run(398045)curvy-parrot-1: run status has changed SUBMITTED -> TERMINATING

Additional information

Reproducibility depends on current offers availability in RunPod. If you can't reproduce, try running multiple configurations in parallel to exhaust availability artificially. Even if your run does start, check server logs to see how many offers were skipped - skipping too many offers is not expected.

jvstme avatar Sep 30 '24 08:09 jvstme

I think an easy solution would be to try more offers. The limit of 15 used to make a lot of sense when the server could not process multiple jobs in parallel. Now we can allow a submitted job to be processed for a longer time. Still, to avoid trying thousands of offers, there should be a limit. I think something like 50 should work but it needs some experimentation with runpod to see if the issue gets fixes.

r4victor avatar Sep 30 '24 10:09 r4victor

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] avatar Oct 31 '24 01:10 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale. Please reopen the issue if it is still relevant.

github-actions[bot] avatar Nov 15 '24 02:11 github-actions[bot]

@jvstme is this issue still relevant?

peterschmidt85 avatar Nov 15 '24 09:11 peterschmidt85

Even more relevant now with the introduction of the Community Cloud. It seems that many spot GPUs are not actually available at the price returned by the RunPod API and can only be deployed at a higher bid. I've reached out to the RunPod team to see if this is the case and if the accuracy of spot prices in the API can be improved.

jvstme avatar Mar 06 '25 15:03 jvstme

RunPod has recognized this issue, we are waiting for a fix from their end

jvstme avatar Mar 11 '25 16:03 jvstme

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] avatar Apr 11 '25 02:04 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale. Please reopen the issue if it is still relevant.

github-actions[bot] avatar Apr 25 '25 02:04 github-actions[bot]

Just throwing in an extra data point on this story which I believe is relevant to this but from a different backend.

My company has drastically limited the types of instances I can run on Azure. Blocking, I would guess, 90% of the instance types available.
As such, when I create a task asking setting spot_policy to 'auto'. If the spot version of one of the few types I do have is unavailable, dstack begins an exhaustive search through tons of more expensive spot types which are unavailable to me, but still cheaper than the same type I do have access to on_demand.
Because the offer check limit (which now appears to be 25 by my logs) is insufficient to dig through all cleaper instance types, my job never begins and I get stuck on FAILED_TO_START_DUE_TO_NO_CAPACITY

I think any fix on the dstack side to this bug would fix mine as well, so I didn't want to create a duplicate bug.

Obviously for me I have at least one potential workaround: Poke my IT department to free up access to more spot types. But for the moment my only recourse is to abandon spot_policy: auto for spot_policy: on_demand to ensure my jobs are run.

ASmedberg-woolpert avatar May 14 '25 03:05 ASmedberg-woolpert

@jvstme Also, we allow limiting instance_types directly in a run configuration for Azure too, right?

peterschmidt85 avatar May 14 '25 10:05 peterschmidt85

@ASmedberg-woolpert, this ticket is exclusive to the RunPod backend, we expect it to be fixed on the RunPod side.

Regarding the issue you're experiencing with Azure, I've suggested some solutions on Discord.

jvstme avatar May 14 '25 10:05 jvstme