[Bug]: The Runpod provider infinitely restarts the pod if the docker args script is incorrect
Steps to reproduce
- Make intentional error in the
docker argsscript. For example, add the commandexit 1orkurl "https://dstack.ai"to the script - Run
dstack runwith any configuratuion
Actual behaviour
- The pod in the runpad provider starts restarting endlessly. This can be seen in the provider's runpad web interface.
dstack serveris stuck a task in theprovisioningstatedstack serverdoes not respond to thedstack stopcommand (!)
Expected behaviour
It is necessary to terminate pod after the first unsuccessful attempt to start the pod
dstack version
0.18.0
Server logs
No response
Additional information
No response
@TheBits @Bihan any ideas how to fix this? How to detect there was an unsuccessful attempt to start the pod?
@TheBits
Hi I tried to recreate the bug by adding exit 1 to docker arguments in
def get_docker_args(authorized_keys):
commands = get_docker_commands(authorized_keys, False)
command = " && ".join(commands)
command_escaped = command.replace('"', '\\"')
command_escaped = command_escaped.replace("'", '\\"')
command_escaped = command_escaped.replace("\n", "\\n")
return f"bash -c '{command_escaped} && exit 1'"
Refer
I also tried curl "https://dstack.ai" ,but the pod did not restart. Did I miss something?
@TheBits, @Bihan
I also tried running tasks with bad commands, e.g.:
type: task
commands:
- errorcommand
The job fails as expected. The runpod instance is terminated.
✗ dstack run . -f .dstack/confs/task.yaml -b runpod
Configuration .dstack/confs/task.yaml
Project main
User admin
Pool name default-pool
Min resources 2..xCPU, 8GB.., 100GB.. (disk)
Max price -
Max duration 72h
Spot policy auto
Retry policy no
Creation policy reuse-or-create
Termination policy destroy-after-idle
Termination idle time 300s
# BACKEND REGION INSTANCE RESOURCES SPOT PRICE
1 runpod EUR-IS-1 NVIDIA RTX A4000 16xCPU, 31GB, yes $0.19
1xRTXA4000 (16GB),
100GB (disk)
2 runpod EU-SE-1 NVIDIA RTX A4000 9xCPU, 50GB, 1xRTXA4000 yes $0.19
(16GB), 100GB (disk)
3 runpod EUR-NO-1 NVIDIA RTX A4000 6xCPU, 23GB, 1xRTXA4000 yes $0.19
(16GB), 100GB (disk)
...
Shown 3 of 250 offers, $37.52 max
Continue? [y/n]: y
bright-goat-1 provisioning completed (terminating)
Run failed with error code JobTerminationReason.CONTAINER_EXITED_WITH_ERROR.
Check CLI and server logs for more details.
(venv) ➜ my_dstack_public git:(master) ✗ dstack ps
NAME BACKEND REGION RESOURCES SPOT PRICE STATUS SUBMITTED
bright-goat-1 runpod EUR-NO-1 6xCPU, yes $0.19 failed 3 mins ago
23GB,
1xRTXA4000
(16GB),
100GB
(disk)
(venv) ➜ my_dstack_public git:(master) ✗ dstack logs bright-goat-1
bash: errorcommand: command not found
@TheBits, please see if you can still reproduce the issue and provide the specific steps. Please close if it cannot be reproduced.
Ok, after adding a non-existent command to get_docker_commands() I managed to reproduce the issue. RunPod tries to rerun the docker_args over and over. Here's the pod logs:
2024-04-24T05:07:04.637000785Z bash: failll: command not found
2024-04-24T05:07:11.849114593Z bash: failll: command not found
2024-04-24T05:07:27.842020164Z bash: failll: command not found
2024-04-24T05:07:43.806167622Z bash: failll: command not found
2024-04-24T05:07:59.809122299Z bash: failll: command not found
2024-04-24T05:08:15.816833559Z bash: failll: command not found
2024-04-24T05:08:31.683330842Z bash: failll: command not found
2024-04-24T05:08:47.613584462Z bash: failll: command not found
This behavior is not desirable since we don't fail the pod fast. But after #1149 it's not critical – there is a pod waiting timeout after which the pod will be termianted.
@bihan
return f"bash -c '{command_escaped} && exit 1'".
The exit 1 is the last command, it stands after the runner. Put exit 1 as first command in script.
return f"bash -c 'exit 1; {command_escaped}'".
@r4victor The command errorcommand is your user script and it runs and fails inside dstack-runner. After the end of run of the dstack-runner, we will terminate the container. As a result, there is no issue.
@TheBits, see my last comment https://github.com/dstackai/dstack/issues/1142#issuecomment-2074042344
I described what happens when putting a bad command in docker_args directly.
This issue is stale because it has been open for 30 days with no activity.