dstack icon indicating copy to clipboard operation
dstack copied to clipboard

[Feature]: Display offer deployment attemps in `dstack apply`

Open jvstme opened this issue 8 months ago • 4 comments

Problem

The UX is poor when the user applies a run or fleet configuration, some offers fail, and dstack tries other offers.

  • It can remain unnoticed that dstack provisioned a different (e.g. much more expensive) instance than those shown at the top of the offers list.
  • It is not possible to troubleshoot why some offers failed.
  • There may be no visible progress for several minutes.
    ⠹ Launching yellow-goose-1...
     NAME            BACKEND  RESOURCES  PRICE  STATUS     SUBMITTED  
     yellow-goose-1                             submitted  5 mins ago
    

Solution

  1. Display the offer dstack is currently trying to provision. For example, it can be displayed in the provisioning table with status "submitted":
    Submit a new run? [y/n]: y
    ⠏ Launching cuddly-bullfrog-1...
     NAME               BACKEND      RESOURCES                                          PRICE  STATUS     SUBMITTED  
     cuddly-bullfrog-1  runpod (CZ)  4xCPU, 31GB, 1xA4000 (16GB), 100.0GB (disk), SPOT  $0.09  submitted  35 sec ago
    
  2. Display the offers that failed and their (possibly ugly) backend-specific provisioning errors. For example, they can be logged above the provisioning table.
    Submit a new run? [y/n]: y
    
    Failed launching cpu-d3 4vcpu-16gb in nebius (eu-north1):
        Request error RESOURCE_EXHAUSTED: failed to create disk: acquire quota: rpc error:
        code = ResourceExhausted desc = Quota limit exceeded. Exceeded limit for container tenant-e00vivw0vwq2e2myxc, quota
        compute.disk.size.network-ssd.; request_id: e65d1a3a-3ab1-4c03-b04c-cd880470907c; trace_id: 1497e9cc4d1515a060129d214fd76df0; Caused by
        error: 1. QuotaFailure in service iam quota failure, violations:  compute.disk.size.network-ssd 3221225472000 of 2199023255552: Exceeded
        limit for container tenant-e00vivw0vwq2e2myxc; (additional details not shown)
    Failed launching NVIDIA RTX A2000 (spot) in runpod (ES):
        There are no longer any instances available with the request specifications. Please try again later.
    Failed launching NVIDIA GeForce RTX 3070 (spot) in runpod (TT):
        There are no longer any instances available with the request specifications. Please try again later.
    Failed launching p4d.24xlarge (spot) in aws (us-east-2):
        An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 4): We currently do not
        have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-east-2b). Our system will be working on provisioning
        additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-2a
    
    ⠏ Launching yellow-goose-1...
     NAME            BACKEND          RESOURCES                                      PRICE   STATUS   SUBMITTED  
     yellow-goose-1  aws (us-east-1)  96xCPU, 1152GB, 8xA100 (40GB), 100.0GB (disk)  $32.77  pulling  7 mins ago
    

Workaround

Follow dstack server logs if you have access.

Implementation notes

Not too easy to implement, might require storing the current offer and provisioning errors in the database.

Would you like to help us implement this feature by sending a PR?

Yes

jvstme avatar Apr 08 '25 20:04 jvstme