dstack icon indicating copy to clipboard operation
dstack copied to clipboard

[Bug]: Retry Policy not re-submitting run with Runpod spot provisioning.

Open Bihan opened this issue 2 months ago • 5 comments

Steps to reproduce

This PR 1119 provides spot provisioning for Runpod. Provisioning with dstack run . -b runpod --gpu 1 --spot --retry would re-submit the run if pod is terminated from the web console.

Actual behaviour

After rebasing the PR branch 1119 with the latest master dstack 0.18.0, the re-submission of run is not happening. Please see the logs below:

[14:30:55] DEBUG dstack._internal.server.background.tasks.process_running_jobs:207 job(204262)strange-bear-1-0-0: process running job, age=0:01:47.196351 [14:30:59] DEBUG dstack._internal.server.background.tasks.process_running_jobs:207 job(204262)strange-bear-1-0-0: process running job, age=0:01:51.190070 [14:31:03] DEBUG dstack._internal.server.background.tasks.process_running_jobs:207 job(204262)strange-bear-1-0-0: process running job, age=0:01:55.199926 [14:31:06] DEBUG dstack._internal.server.app:176 Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.027767s [14:31:07] DEBUG dstack._internal.server.app:176 Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.026637s DEBUG dstack._internal.server.background.tasks.process_running_jobs:207 job(204262)strange-bear-1-0-0: process running job, age=0:01:59.202287 [14:31:08] DEBUG dstack._internal.server.app:176 Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.030931s [14:31:09] DEBUG dstack._internal.server.app:176 Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.024694s INFO dstack._internal.server.background.tasks.process_running_jobs:448 job(204262)strange-bear-1-0-0: now is TERMINATING INFO dstack._internal.server.background.tasks.process_runs:308 run(4c911b)strange-bear-1: run status has changed RUNNING -> TERMINATING [14:31:10] DEBUG dstack._internal.server.app:176 Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.027628s [14:31:11] DEBUG dstack._internal.server.app:176 Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.025220s DEBUG dstack._internal.server.services.jobs:201 job(204262)strange-bear-1-0-0: stopping container INFO dstack._internal.server.services.jobs:217 job(204262)strange-bear-1-0-0: instance 'strange-bear-1-0-0' has been released, new status is TERMINATING INFO dstack._internal.server.services.jobs:234 job(204262)strange-bear-1-0-0: job status is FAILED, reason: CONTAINER_EXITED_WITH_ERROR [14:31:12] DEBUG dstack._internal.server.app:176 Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.025749s INFO dstack._internal.server.services.runs:811 run(4c911b)strange-bear-1: run status has changed TERMINATING -> FAILED, reason: JOB_FAILED [14:31:13] DEBUG dstack._internal.server.app:176 Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.026232s [14:31:16] DEBUG dstack._internal.core.backends.runpod.compute:121 The instance with name oqwc4y4mbnl1yt not found INFO dstack._internal.server.background.tasks.process_instances:429 instance strange-bear-1-0-0 terminated

Expected behaviour

Before rebasing with dstack 0.18.0 Re-submission happened as below:

[14:19:35] DEBUG job(8ad8c5)witty-sloth-1-0-0: process running job, age=0:01:13.342728 dstack._internal.server.background.tasks.process_running_jobs:207 [14:19:39] DEBUG job(8ad8c5)witty-sloth-1-0-0: process running job, age=0:01:17.343550 dstack._internal.server.background.tasks.process_running_jobs:207 [14:19:43] DEBUG job(8ad8c5)witty-sloth-1-0-0: process running job, age=0:01:21.338029 dstack._internal.server.background.tasks.process_running_jobs:207 [14:19:44] DEBUG SSH tunnel failed: b'Connection closed by 194.26.196.139 port 18937\r\n' dstack._internal.core.services.ssh.tunnel:63 DEBUG Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.029451s dstack._internal.server.app:175 [14:19:45] DEBUG Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.024377s dstack._internal.server.app:175 DEBUG SSH tunnel failed: b'ssh: connect to host 194.26.196.139 port 18937: Connection refused\r\n' dstack._internal.core.services.ssh.tunnel:63 [14:19:46] DEBUG Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.027882s dstack._internal.server.app:175 DEBUG SSH tunnel failed: b'ssh: connect to host 194.26.196.139 port 18937: Connection refused\r\n' dstack._internal.core.services.ssh.tunnel:63 WARNING job(8ad8c5)witty-sloth-1-0-0: failed because runner is not available or return an dstack._internal.server.background.tasks.process_running_jobs:219 error, age=0:01:24.534835 INFO run(b1bd94)witty-sloth-1: run status has changed RUNNING -> PENDING dstack._internal.server.background.tasks.process_runs:308 [14:19:47] DEBUG Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.028204s dstack._internal.server.app:175 DEBUG job(8ad8c5)witty-sloth-1-0-0: stopping container dstack._internal.server.services.jobs:201 INFO job(8ad8c5)witty-sloth-1-0-0: instance 'witty-sloth-1-0-0' has been released, new status is TERMINATING dstack._internal.server.services.jobs:217 INFO job(8ad8c5)witty-sloth-1-0-0: job status is FAILED, reason: INTERRUPTED_BY_NO_CAPACITY dstack._internal.server.services.jobs:234 [14:19:48] DEBUG Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.027822s dstack._internal.server.app:175 [14:19:49] DEBUG Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.027992s dstack._internal.server.app:175 INFO run(b1bd94)witty-sloth-1: run status has changed PENDING -> SUBMITTED dstack._internal.server.background.tasks.process_runs:172 [14:19:50] DEBUG Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.026344s dstack._internal.server.app:175 [14:19:51] DEBUG job(6b4836)witty-sloth-1-0-0: provisioning has started dstack._internal.server.background.tasks.process_submitted_jobs:97 DEBUG job(6b4836)witty-sloth-1-0-0: trying NVIDIA RTX A4000 in runpod/EUR-IS-1 for dstack._internal.server.background.tasks.process_submitted_jobs:263 $0.1500 per hour

dstack version

dstack 0.18.0

Server logs

No response

Additional information

No response

Bihan avatar Apr 16 '24 09:04 Bihan

Could you please include the output of the dstack run command?

peterschmidt85 avatar Apr 16 '24 10:04 peterschmidt85

Could you please include the output of the dstack run command?

@peterschmidt85 To exit, press Ctrl+C.Run failed with error code JobTerminationReason.CONTAINER_EXITED_WITH_ERROR. Check CLI and server logs for more details.

Bihan avatar Apr 16 '24 10:04 Bihan

@Bihan I mean the entire output, including everything before the confirmation line, including the plan and all what is above.

peterschmidt85 avatar Apr 16 '24 10:04 peterschmidt85

@Bihan I mean the entire output, including everything before the confirmation line, including the plan and all what is above. @peterschmidt85

(venv) bihan@Bihans-MacBook-Pro config_files_for_test % dstack run . -b runpod --gpu 1 --spot --retry
 Configuration          .dstack.yml                         
 Project                main                                
 User                   admin                               
 Pool name              default-pool                        
 Min resources          2..xCPU, 8GB.., 1xGPU, 100GB (disk) 
 Max price              -                                   
 Max duration           6h                                  
 Spot policy            spot                                
 Retry policy           yes                                 
 Creation policy        reuse-or-create                     
 Termination policy     destroy-after-idle                  
 Termination idle time  300s                                

 #  BACKEND  REGION    INSTANCE                        RESOURCES                                       SPOT  PRICE   
 1  runpod   EUR-IS-1  NVIDIA RTX A4000                16xCPU, 31GB, 1xRTXA4000 (16GB), 100GB (disk)   yes   $0.15   
 2  runpod   EU-RO-1   NVIDIA RTX 4000 Ada Generation  18xCPU, 100GB, 1xRTX4000 (20GB), 100GB (disk)   yes   $0.15   
 3  runpod   EUR-NO-1  NVIDIA RTX A4000                48xCPU, 184GB, 1xRTXA4000 (16GB), 100GB (disk)  yes   $0.15   
    ...                                                                                                              
 Shown 3 of 158 offers, $2.49 max

Continue? [y/n]: y
plastic-cobra-1 provisioning completed (running)
/tmp/vscode-server- 100%[===================>]  54.80M   111MB/s    in 0.5s    
mkdir: created directory '/root/.vscode-server'
mkdir: created directory '/root/.vscode-server/bin'
mkdir: created directory '/root/.vscode-server/bin/863d2581ecda6849923a2118d93a088b0745d9d6'
Installing extensions...
Installing extension 'ms-python.python'...
Installing extension 'ms-toolsai.jupyter'...
Extension 'ms-toolsai.jupyter-keymap' v1.1.2 was successfully installed.
Extension 'ms-toolsai.vscode-jupyter-slideshow' v0.1.5 was successfully installed.
Extension 'ms-toolsai.vscode-jupyter-cell-tags' v0.1.8 was successfully installed.
Extension 'ms-toolsai.jupyter' v2024.2.0 was successfully installed.
Extension 'ms-toolsai.jupyter-renderers' v1.0.17 was successfully installed.
Extension 'ms-python.debugpy' v2024.4.0 was successfully installed.
Extension 'ms-python.python' v2024.4.1 was successfully installed.
Extension 'ms-python.vscode-pylance' v2024.4.1 was successfully installed.
pip install ipykernel...

To open in VS Code Desktop, use link below:

  vscode://vscode-remote/ssh-remote+plastic-cobra-1/workflow

To connect via SSH, use: `ssh plastic-cobra-1`

To exit, press Ctrl+C.Run failed with error code JobTerminationReason.CONTAINER_EXITED_WITH_ERROR. Check CLI and server logs for more details.
(venv) bihan@Bihans-MacBook-Pro config_files_for_test % 

Bihan avatar Apr 16 '24 10:04 Bihan

Thank you!

peterschmidt85 avatar Apr 16 '24 11:04 peterschmidt85