dstack
dstack copied to clipboard
[Feature]: Display offer deployment attemps in `dstack apply`
Problem
The UX is poor when the user applies a run or fleet configuration, some offers fail, and dstack tries other offers.
- It can remain unnoticed that
dstackprovisioned a different (e.g. much more expensive) instance than those shown at the top of the offers list. - It is not possible to troubleshoot why some offers failed.
- There may be no visible progress for several minutes.
⠹ Launching yellow-goose-1... NAME BACKEND RESOURCES PRICE STATUS SUBMITTED yellow-goose-1 submitted 5 mins ago
Solution
- Display the offer
dstackis currently trying to provision. For example, it can be displayed in the provisioning table with status "submitted":Submit a new run? [y/n]: y ⠏ Launching cuddly-bullfrog-1... NAME BACKEND RESOURCES PRICE STATUS SUBMITTED cuddly-bullfrog-1 runpod (CZ) 4xCPU, 31GB, 1xA4000 (16GB), 100.0GB (disk), SPOT $0.09 submitted 35 sec ago - Display the offers that failed and their (possibly ugly) backend-specific provisioning errors. For example, they can be logged above the provisioning table.
Submit a new run? [y/n]: y Failed launching cpu-d3 4vcpu-16gb in nebius (eu-north1): Request error RESOURCE_EXHAUSTED: failed to create disk: acquire quota: rpc error: code = ResourceExhausted desc = Quota limit exceeded. Exceeded limit for container tenant-e00vivw0vwq2e2myxc, quota compute.disk.size.network-ssd.; request_id: e65d1a3a-3ab1-4c03-b04c-cd880470907c; trace_id: 1497e9cc4d1515a060129d214fd76df0; Caused by error: 1. QuotaFailure in service iam quota failure, violations: compute.disk.size.network-ssd 3221225472000 of 2199023255552: Exceeded limit for container tenant-e00vivw0vwq2e2myxc; (additional details not shown) Failed launching NVIDIA RTX A2000 (spot) in runpod (ES): There are no longer any instances available with the request specifications. Please try again later. Failed launching NVIDIA GeForce RTX 3070 (spot) in runpod (TT): There are no longer any instances available with the request specifications. Please try again later. Failed launching p4d.24xlarge (spot) in aws (us-east-2): An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 4): We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-east-2b). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-2a ⠏ Launching yellow-goose-1... NAME BACKEND RESOURCES PRICE STATUS SUBMITTED yellow-goose-1 aws (us-east-1) 96xCPU, 1152GB, 8xA100 (40GB), 100.0GB (disk) $32.77 pulling 7 mins ago
Workaround
Follow dstack server logs if you have access.
Implementation notes
Not too easy to implement, might require storing the current offer and provisioning errors in the database.
Would you like to help us implement this feature by sending a PR?
Yes