dstack icon indicating copy to clipboard operation
dstack copied to clipboard

[Bug]: Retry Policy not re-submitting run with Runpod spot provisioning.

Open Bihan opened this issue 1 year ago • 5 comments

Steps to reproduce

This PR 1119 provides spot provisioning for Runpod. Provisioning with dstack run . -b runpod --gpu 1 --spot --retry would re-submit the run if pod is terminated from the web console.

Actual behaviour

After rebasing the PR branch 1119 with the latest master dstack 0.18.0, the re-submission of run is not happening. Please see the logs below:

[14:30:55] DEBUG dstack._internal.server.background.tasks.process_running_jobs:207 job(204262)strange-bear-1-0-0: process running job, age=0:01:47.196351 [14:30:59] DEBUG dstack._internal.server.background.tasks.process_running_jobs:207 job(204262)strange-bear-1-0-0: process running job, age=0:01:51.190070 [14:31:03] DEBUG dstack._internal.server.background.tasks.process_running_jobs:207 job(204262)strange-bear-1-0-0: process running job, age=0:01:55.199926 [14:31:06] DEBUG dstack._internal.server.app:176 Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.027767s [14:31:07] DEBUG dstack._internal.server.app:176 Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.026637s DEBUG dstack._internal.server.background.tasks.process_running_jobs:207 job(204262)strange-bear-1-0-0: process running job, age=0:01:59.202287 [14:31:08] DEBUG dstack._internal.server.app:176 Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.030931s [14:31:09] DEBUG dstack._internal.server.app:176 Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.024694s INFO dstack._internal.server.background.tasks.process_running_jobs:448 job(204262)strange-bear-1-0-0: now is TERMINATING INFO dstack._internal.server.background.tasks.process_runs:308 run(4c911b)strange-bear-1: run status has changed RUNNING -> TERMINATING [14:31:10] DEBUG dstack._internal.server.app:176 Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.027628s [14:31:11] DEBUG dstack._internal.server.app:176 Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.025220s DEBUG dstack._internal.server.services.jobs:201 job(204262)strange-bear-1-0-0: stopping container INFO dstack._internal.server.services.jobs:217 job(204262)strange-bear-1-0-0: instance 'strange-bear-1-0-0' has been released, new status is TERMINATING INFO dstack._internal.server.services.jobs:234 job(204262)strange-bear-1-0-0: job status is FAILED, reason: CONTAINER_EXITED_WITH_ERROR [14:31:12] DEBUG dstack._internal.server.app:176 Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.025749s INFO dstack._internal.server.services.runs:811 run(4c911b)strange-bear-1: run status has changed TERMINATING -> FAILED, reason: JOB_FAILED [14:31:13] DEBUG dstack._internal.server.app:176 Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.026232s [14:31:16] DEBUG dstack._internal.core.backends.runpod.compute:121 The instance with name oqwc4y4mbnl1yt not found INFO dstack._internal.server.background.tasks.process_instances:429 instance strange-bear-1-0-0 terminated

Expected behaviour

Before rebasing with dstack 0.18.0 Re-submission happened as below:

[14:19:35] DEBUG job(8ad8c5)witty-sloth-1-0-0: process running job, age=0:01:13.342728 dstack._internal.server.background.tasks.process_running_jobs:207 [14:19:39] DEBUG job(8ad8c5)witty-sloth-1-0-0: process running job, age=0:01:17.343550 dstack._internal.server.background.tasks.process_running_jobs:207 [14:19:43] DEBUG job(8ad8c5)witty-sloth-1-0-0: process running job, age=0:01:21.338029 dstack._internal.server.background.tasks.process_running_jobs:207 [14:19:44] DEBUG SSH tunnel failed: b'Connection closed by 194.26.196.139 port 18937\r\n' dstack._internal.core.services.ssh.tunnel:63 DEBUG Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.029451s dstack._internal.server.app:175 [14:19:45] DEBUG Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.024377s dstack._internal.server.app:175 DEBUG SSH tunnel failed: b'ssh: connect to host 194.26.196.139 port 18937: Connection refused\r\n' dstack._internal.core.services.ssh.tunnel:63 [14:19:46] DEBUG Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.027882s dstack._internal.server.app:175 DEBUG SSH tunnel failed: b'ssh: connect to host 194.26.196.139 port 18937: Connection refused\r\n' dstack._internal.core.services.ssh.tunnel:63 WARNING job(8ad8c5)witty-sloth-1-0-0: failed because runner is not available or return an dstack._internal.server.background.tasks.process_running_jobs:219 error, age=0:01:24.534835 INFO run(b1bd94)witty-sloth-1: run status has changed RUNNING -> PENDING dstack._internal.server.background.tasks.process_runs:308 [14:19:47] DEBUG Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.028204s dstack._internal.server.app:175 DEBUG job(8ad8c5)witty-sloth-1-0-0: stopping container dstack._internal.server.services.jobs:201 INFO job(8ad8c5)witty-sloth-1-0-0: instance 'witty-sloth-1-0-0' has been released, new status is TERMINATING dstack._internal.server.services.jobs:217 INFO job(8ad8c5)witty-sloth-1-0-0: job status is FAILED, reason: INTERRUPTED_BY_NO_CAPACITY dstack._internal.server.services.jobs:234 [14:19:48] DEBUG Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.027822s dstack._internal.server.app:175 [14:19:49] DEBUG Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.027992s dstack._internal.server.app:175 INFO run(b1bd94)witty-sloth-1: run status has changed PENDING -> SUBMITTED dstack._internal.server.background.tasks.process_runs:172 [14:19:50] DEBUG Processed request POST http://127.0.0.1:3000/api/project/main/runs/get in 0.026344s dstack._internal.server.app:175 [14:19:51] DEBUG job(6b4836)witty-sloth-1-0-0: provisioning has started dstack._internal.server.background.tasks.process_submitted_jobs:97 DEBUG job(6b4836)witty-sloth-1-0-0: trying NVIDIA RTX A4000 in runpod/EUR-IS-1 for dstack._internal.server.background.tasks.process_submitted_jobs:263 $0.1500 per hour

dstack version

dstack 0.18.0

Server logs

No response

Additional information

No response

Bihan avatar Apr 16 '24 09:04 Bihan

Could you please include the output of the dstack run command?

peterschmidt85 avatar Apr 16 '24 10:04 peterschmidt85

Could you please include the output of the dstack run command?

@peterschmidt85 To exit, press Ctrl+C.Run failed with error code JobTerminationReason.CONTAINER_EXITED_WITH_ERROR. Check CLI and server logs for more details.

Bihan avatar Apr 16 '24 10:04 Bihan

@Bihan I mean the entire output, including everything before the confirmation line, including the plan and all what is above.

peterschmidt85 avatar Apr 16 '24 10:04 peterschmidt85

@Bihan I mean the entire output, including everything before the confirmation line, including the plan and all what is above. @peterschmidt85

(venv) bihan@Bihans-MacBook-Pro config_files_for_test % dstack run . -b runpod --gpu 1 --spot --retry
 Configuration          .dstack.yml                         
 Project                main                                
 User                   admin                               
 Pool name              default-pool                        
 Min resources          2..xCPU, 8GB.., 1xGPU, 100GB (disk) 
 Max price              -                                   
 Max duration           6h                                  
 Spot policy            spot                                
 Retry policy           yes                                 
 Creation policy        reuse-or-create                     
 Termination policy     destroy-after-idle                  
 Termination idle time  300s                                

 #  BACKEND  REGION    INSTANCE                        RESOURCES                                       SPOT  PRICE   
 1  runpod   EUR-IS-1  NVIDIA RTX A4000                16xCPU, 31GB, 1xRTXA4000 (16GB), 100GB (disk)   yes   $0.15   
 2  runpod   EU-RO-1   NVIDIA RTX 4000 Ada Generation  18xCPU, 100GB, 1xRTX4000 (20GB), 100GB (disk)   yes   $0.15   
 3  runpod   EUR-NO-1  NVIDIA RTX A4000                48xCPU, 184GB, 1xRTXA4000 (16GB), 100GB (disk)  yes   $0.15   
    ...                                                                                                              
 Shown 3 of 158 offers, $2.49 max

Continue? [y/n]: y
plastic-cobra-1 provisioning completed (running)
/tmp/vscode-server- 100%[===================>]  54.80M   111MB/s    in 0.5s    
mkdir: created directory '/root/.vscode-server'
mkdir: created directory '/root/.vscode-server/bin'
mkdir: created directory '/root/.vscode-server/bin/863d2581ecda6849923a2118d93a088b0745d9d6'
Installing extensions...
Installing extension 'ms-python.python'...
Installing extension 'ms-toolsai.jupyter'...
Extension 'ms-toolsai.jupyter-keymap' v1.1.2 was successfully installed.
Extension 'ms-toolsai.vscode-jupyter-slideshow' v0.1.5 was successfully installed.
Extension 'ms-toolsai.vscode-jupyter-cell-tags' v0.1.8 was successfully installed.
Extension 'ms-toolsai.jupyter' v2024.2.0 was successfully installed.
Extension 'ms-toolsai.jupyter-renderers' v1.0.17 was successfully installed.
Extension 'ms-python.debugpy' v2024.4.0 was successfully installed.
Extension 'ms-python.python' v2024.4.1 was successfully installed.
Extension 'ms-python.vscode-pylance' v2024.4.1 was successfully installed.
pip install ipykernel...

To open in VS Code Desktop, use link below:

  vscode://vscode-remote/ssh-remote+plastic-cobra-1/workflow

To connect via SSH, use: `ssh plastic-cobra-1`

To exit, press Ctrl+C.Run failed with error code JobTerminationReason.CONTAINER_EXITED_WITH_ERROR. Check CLI and server logs for more details.
(venv) bihan@Bihans-MacBook-Pro config_files_for_test % 

Bihan avatar Apr 16 '24 10:04 Bihan

Thank you!

peterschmidt85 avatar Apr 16 '24 11:04 peterschmidt85

This issue is stale because it has been open for 30 days with no activity.

peterschmidt85 avatar May 17 '24 01:05 peterschmidt85

This issue was closed because it has been inactive for 14 days since being marked as stale. Please reopen the issue if it is still relevant.

peterschmidt85 avatar May 31 '24 01:05 peterschmidt85