dstack
dstack copied to clipboard
[Bug]: Misleading error code when multi-replica service provisioining fails
Steps to reproduce
Run a two-replica service, but set requirements that will only match one instance. For example:
- Provision a single-instance fleet.
> cat fleets/cloud.dstack.yml
type: fleet
name: cloud
nodes: 1
> dstack apply -f fleets/cloud.dstack.yml -y
- Wait until the instance is idle.
- Try running a two-replica service using just this one instance.
> cat services/httpbin.dstack.yml
type: service
name: httpbin
image: kennethreitz/httpbin
port: 80
replicas: 2
> dstack apply -f services/httpbin.dstack.yml --reuse -y
Actual behaviour
Run failed with error code TERMINATED_BY_SERVER.
Check CLI, server, and run logs for more details.
Expected behaviour
Run fails with FAILED_TO_START_DUE_TO_NO_CAPACITY, CLI shows a relevant message.
All provisioning attempts failed. This is likely due to cloud providers not having enough capacity. Check CLI and server logs for more details.
dstack version
master
Server logs
Only relevant logs:
DEBUG dstack._internal.server.background.tasks.process_submitted_jobs:98 job(a01b1f)httpbin-0-0: provisioning has started
INFO dstack._internal.server.background.tasks.process_submitted_jobs:333 The job httpbin-0-0 switched instance cloud-0 status to BUSY
INFO dstack._internal.server.background.tasks.process_submitted_jobs:342 job(a01b1f)httpbin-0-0: now is provisioning on 'cloud-0'
[21:02:57] DEBUG dstack._internal.server.background.tasks.process_submitted_jobs:98 job(9c111a)httpbin-0-1: provisioning has started
[21:03:01] DEBUG dstack._internal.server.background.tasks.process_submitted_jobs:98 job(a01b1f)httpbin-0-0: provisioning has started
INFO dstack._internal.server.background.tasks.process_runs:330 run(af256e)httpbin: run status has changed SUBMITTED -> PROVISIONING
[21:03:06] DEBUG dstack._internal.server.background.tasks.process_submitted_jobs:98 job(9c111a)httpbin-0-1: provisioning has started
DEBUG dstack._internal.server.background.tasks.process_submitted_jobs:213 job(9c111a)httpbin-0-1: reuse instance failed
[21:03:07] INFO dstack._internal.server.services.jobs:262 job(9c111a)httpbin-0-1: job status is FAILED, reason: FAILED_TO_START_DUE_TO_NO_CAPACITY
INFO dstack._internal.server.background.tasks.process_running_jobs:413 job(a01b1f)httpbin-0-0: now is PULLING
INFO dstack._internal.server.background.tasks.process_runs:330 run(af256e)httpbin: run status has changed PROVISIONING -> TERMINATING
[21:03:15] DEBUG dstack._internal.server.services.jobs:213 job(a01b1f)httpbin-0-0: stopping container
INFO dstack._internal.server.services.jobs:247 job(a01b1f)httpbin-0-0: instance 'cloud-0' has been released, new status is IDLE
INFO dstack._internal.server.services.jobs:262 job(a01b1f)httpbin-0-0: job status is TERMINATED, reason: TERMINATED_BY_SERVER
INFO dstack._internal.server.services.runs:933 run(af256e)httpbin: run status has changed TERMINATING -> FAILED, reason: JOB_FAILED
This issue is stale because it has been open for 30 days with no activity.
This issue was closed because it has been inactive for 14 days since being marked as stale. Please reopen the issue if it is still relevant.