[Bug]: Retry Policy not re-submitting run with Runpod spot provisioning.
Steps to reproduce
This PR 1119 provides spot provisioning for Runpod.
Provisioning with dstack run . -b runpod --gpu 1 --spot --retry would re-submit the run if pod is terminated from the web console.
Actual behaviour
After rebasing the PR branch 1119 with the latest master dstack 0.18.0, the re-submission of run is not happening. Please see the logs below:
[14:30:55] DEBUG dstack._internal.server.background.tasks.process_running_jobs:207 job(204262)strange-bear-1-0-0: process running job, age=0:01:47.196351 [14:30:59] DEBUG dstack._internal.server.background.tasks.process_running_jobs:207 job(204262)strange-bear-1-0-0: process running job, age=0:01:51.190070 [14:31:03] DEBUG dstack._internal.server.background.tasks.process_running_jobs:207 job(204262)strange-bear-1-0-0: process running job, age=0:01:55.199926 [14:31:06] DEBUG dstack._internal.server.app:176 Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.027767s [14:31:07] DEBUG dstack._internal.server.app:176 Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.026637s DEBUG dstack._internal.server.background.tasks.process_running_jobs:207 job(204262)strange-bear-1-0-0: process running job, age=0:01:59.202287 [14:31:08] DEBUG dstack._internal.server.app:176 Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.030931s [14:31:09] DEBUG dstack._internal.server.app:176 Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.024694s INFO dstack._internal.server.background.tasks.process_running_jobs:448 job(204262)strange-bear-1-0-0: now is TERMINATING INFO dstack._internal.server.background.tasks.process_runs:308 run(4c911b)strange-bear-1: run status has changed RUNNING -> TERMINATING [14:31:10] DEBUG dstack._internal.server.app:176 Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.027628s [14:31:11] DEBUG dstack._internal.server.app:176 Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.025220s DEBUG dstack._internal.server.services.jobs:201 job(204262)strange-bear-1-0-0: stopping container INFO dstack._internal.server.services.jobs:217 job(204262)strange-bear-1-0-0: instance 'strange-bear-1-0-0' has been released, new status is TERMINATING INFO dstack._internal.server.services.jobs:234 job(204262)strange-bear-1-0-0: job status is FAILED, reason: CONTAINER_EXITED_WITH_ERROR [14:31:12] DEBUG dstack._internal.server.app:176 Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.025749s INFO dstack._internal.server.services.runs:811 run(4c911b)strange-bear-1: run status has changed TERMINATING -> FAILED, reason: JOB_FAILED [14:31:13] DEBUG dstack._internal.server.app:176 Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.026232s [14:31:16] DEBUG dstack._internal.core.backends.runpod.compute:121 The instance with name oqwc4y4mbnl1yt not found INFO dstack._internal.server.background.tasks.process_instances:429 instance strange-bear-1-0-0 terminated
Expected behaviour
Before rebasing with dstack 0.18.0 Re-submission happened as below:
[14:19:35] DEBUG job(8ad8c5)witty-sloth-1-0-0: process running job, age=0:01:13.342728 dstack._internal.server.background.tasks.process_running_jobs:207 [14:19:39] DEBUG job(8ad8c5)witty-sloth-1-0-0: process running job, age=0:01:17.343550 dstack._internal.server.background.tasks.process_running_jobs:207 [14:19:43] DEBUG job(8ad8c5)witty-sloth-1-0-0: process running job, age=0:01:21.338029 dstack._internal.server.background.tasks.process_running_jobs:207 [14:19:44] DEBUG SSH tunnel failed: b'Connection closed by 194.26.196.139 port 18937\r\n' dstack._internal.core.services.ssh.tunnel:63 DEBUG Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.029451s dstack._internal.server.app:175 [14:19:45] DEBUG Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.024377s dstack._internal.server.app:175 DEBUG SSH tunnel failed: b'ssh: connect to host 194.26.196.139 port 18937: Connection refused\r\n' dstack._internal.core.services.ssh.tunnel:63 [14:19:46] DEBUG Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.027882s dstack._internal.server.app:175 DEBUG SSH tunnel failed: b'ssh: connect to host 194.26.196.139 port 18937: Connection refused\r\n' dstack._internal.core.services.ssh.tunnel:63 WARNING job(8ad8c5)witty-sloth-1-0-0: failed because runner is not available or return an dstack._internal.server.background.tasks.process_running_jobs:219 error, age=0:01:24.534835 INFO run(b1bd94)witty-sloth-1: run status has changed RUNNING -> PENDING dstack._internal.server.background.tasks.process_runs:308 [14:19:47] DEBUG Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.028204s dstack._internal.server.app:175 DEBUG job(8ad8c5)witty-sloth-1-0-0: stopping container dstack._internal.server.services.jobs:201 INFO job(8ad8c5)witty-sloth-1-0-0: instance 'witty-sloth-1-0-0' has been released, new status is TERMINATING dstack._internal.server.services.jobs:217 INFO job(8ad8c5)witty-sloth-1-0-0: job status is FAILED, reason: INTERRUPTED_BY_NO_CAPACITY dstack._internal.server.services.jobs:234 [14:19:48] DEBUG Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.027822s dstack._internal.server.app:175 [14:19:49] DEBUG Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.027992s dstack._internal.server.app:175 INFO run(b1bd94)witty-sloth-1: run status has changed PENDING -> SUBMITTED dstack._internal.server.background.tasks.process_runs:172 [14:19:50] DEBUG Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.026344s dstack._internal.server.app:175 [14:19:51] DEBUG job(6b4836)witty-sloth-1-0-0: provisioning has started dstack._internal.server.background.tasks.process_submitted_jobs:97 DEBUG job(6b4836)witty-sloth-1-0-0: trying NVIDIA RTX A4000 in runpod/EUR-IS-1 for dstack._internal.server.background.tasks.process_submitted_jobs:263 $0.1500 per hour
dstack version
dstack 0.18.0
Server logs
No response
Additional information
No response
Could you please include the output of the dstack run command?
Could you please include the output of the
dstack run command?
@peterschmidt85 To exit, press Ctrl+C.Run failed with error code JobTerminationReason.CONTAINER_EXITED_WITH_ERROR. Check CLI and server logs for more details.
@Bihan I mean the entire output, including everything before the confirmation line, including the plan and all what is above.
@Bihan I mean the entire output, including everything before the confirmation line, including the plan and all what is above. @peterschmidt85
(venv) bihan@Bihans-MacBook-Pro config_files_for_test % dstack run . -b runpod --gpu 1 --spot --retry
Configuration .dstack.yml
Project main
User admin
Pool name default-pool
Min resources 2..xCPU, 8GB.., 1xGPU, 100GB (disk)
Max price -
Max duration 6h
Spot policy spot
Retry policy yes
Creation policy reuse-or-create
Termination policy destroy-after-idle
Termination idle time 300s
# BACKEND REGION INSTANCE RESOURCES SPOT PRICE
1 runpod EUR-IS-1 NVIDIA RTX A4000 16xCPU, 31GB, 1xRTXA4000 (16GB), 100GB (disk) yes $0.15
2 runpod EU-RO-1 NVIDIA RTX 4000 Ada Generation 18xCPU, 100GB, 1xRTX4000 (20GB), 100GB (disk) yes $0.15
3 runpod EUR-NO-1 NVIDIA RTX A4000 48xCPU, 184GB, 1xRTXA4000 (16GB), 100GB (disk) yes $0.15
...
Shown 3 of 158 offers, $2.49 max
Continue? [y/n]: y
plastic-cobra-1 provisioning completed (running)
/tmp/vscode-server- 100%[===================>] 54.80M 111MB/s in 0.5s
mkdir: created directory '/root/.vscode-server'
mkdir: created directory '/root/.vscode-server/bin'
mkdir: created directory '/root/.vscode-server/bin/863d2581ecda6849923a2118d93a088b0745d9d6'
Installing extensions...
Installing extension 'ms-python.python'...
Installing extension 'ms-toolsai.jupyter'...
Extension 'ms-toolsai.jupyter-keymap' v1.1.2 was successfully installed.
Extension 'ms-toolsai.vscode-jupyter-slideshow' v0.1.5 was successfully installed.
Extension 'ms-toolsai.vscode-jupyter-cell-tags' v0.1.8 was successfully installed.
Extension 'ms-toolsai.jupyter' v2024.2.0 was successfully installed.
Extension 'ms-toolsai.jupyter-renderers' v1.0.17 was successfully installed.
Extension 'ms-python.debugpy' v2024.4.0 was successfully installed.
Extension 'ms-python.python' v2024.4.1 was successfully installed.
Extension 'ms-python.vscode-pylance' v2024.4.1 was successfully installed.
pip install ipykernel...
To open in VS Code Desktop, use link below:
vscode://vscode-remote/ssh-remote+plastic-cobra-1/workflow
To connect via SSH, use: `ssh plastic-cobra-1`
To exit, press Ctrl+C.Run failed with error code JobTerminationReason.CONTAINER_EXITED_WITH_ERROR. Check CLI and server logs for more details.
(venv) bihan@Bihans-MacBook-Pro config_files_for_test %
Thank you!
This issue is stale because it has been open for 30 days with no activity.
This issue was closed because it has been inactive for 14 days since being marked as stale. Please reopen the issue if it is still relevant.