dstack icon indicating copy to clipboard operation
dstack copied to clipboard

[Bug]: The Runpod provider infinitely restarts the pod if the docker args script is incorrect

Open TheBits opened this issue 1 year ago • 8 comments

Steps to reproduce

  • Make intentional error in the docker args script. For example, add the command exit 1 or kurl "https://dstack.ai" to the script
  • Run dstack run with any configuratuion

Actual behaviour

  1. The pod in the runpad provider starts restarting endlessly. This can be seen in the provider's runpad web interface.
  2. dstack server is stuck a task in the provisioning state
  3. dstack server does not respond to the dstack stop command (!)

Expected behaviour

It is necessary to terminate pod after the first unsuccessful attempt to start the pod

dstack version

0.18.0

Server logs

No response

Additional information

No response

TheBits avatar Apr 16 '24 07:04 TheBits

@TheBits @Bihan any ideas how to fix this? How to detect there was an unsuccessful attempt to start the pod?

r4victor avatar Apr 19 '24 05:04 r4victor

@TheBits Hi I tried to recreate the bug by adding exit 1 to docker arguments in

def get_docker_args(authorized_keys):
    commands = get_docker_commands(authorized_keys, False)
    command = " && ".join(commands)
    command_escaped = command.replace('"', '\\"')
    command_escaped = command_escaped.replace("'", '\\"')
    command_escaped = command_escaped.replace("\n", "\\n")
    return f"bash -c '{command_escaped} && exit 1'"

Refer I also tried curl "https://dstack.ai" ,but the pod did not restart. Did I miss something?

Bihan avatar Apr 19 '24 14:04 Bihan

@TheBits, @Bihan

I also tried running tasks with bad commands, e.g.:

type: task
commands:
  - errorcommand

The job fails as expected. The runpod instance is terminated.

✗ dstack run . -f .dstack/confs/task.yaml -b runpod
 Configuration          .dstack/confs/task.yaml        
 Project                main                           
 User                   admin                          
 Pool name              default-pool                   
 Min resources          2..xCPU, 8GB.., 100GB.. (disk) 
 Max price              -                              
 Max duration           72h                            
 Spot policy            auto                           
 Retry policy           no                             
 Creation policy        reuse-or-create                
 Termination policy     destroy-after-idle             
 Termination idle time  300s                           

 #  BACKEND  REGION    INSTANCE          RESOURCES                SPOT  PRICE   
 1  runpod   EUR-IS-1  NVIDIA RTX A4000  16xCPU, 31GB,            yes   $0.19   
                                         1xRTXA4000 (16GB),                     
                                         100GB (disk)                           
 2  runpod   EU-SE-1   NVIDIA RTX A4000  9xCPU, 50GB, 1xRTXA4000  yes   $0.19   
                                         (16GB), 100GB (disk)                   
 3  runpod   EUR-NO-1  NVIDIA RTX A4000  6xCPU, 23GB, 1xRTXA4000  yes   $0.19   
                                         (16GB), 100GB (disk)                   
    ...                                                                         
 Shown 3 of 250 offers, $37.52 max

Continue? [y/n]: y
bright-goat-1 provisioning completed (terminating)
Run failed with error code JobTerminationReason.CONTAINER_EXITED_WITH_ERROR. 
Check CLI and server logs for more details.
(venv) ➜  my_dstack_public git:(master) ✗ dstack ps
 NAME           BACKEND  REGION    RESOURCES    SPOT  PRICE  STATUS  SUBMITTED  
 bright-goat-1  runpod   EUR-NO-1  6xCPU,       yes   $0.19  failed  3 mins ago 
                                   23GB,                                        
                                   1xRTXA4000                                   
                                   (16GB),                                      
                                   100GB                                        
                                   (disk)                                       
(venv) ➜  my_dstack_public git:(master) ✗ dstack logs bright-goat-1
bash: errorcommand: command not found

r4victor avatar Apr 22 '24 12:04 r4victor

@TheBits, please see if you can still reproduce the issue and provide the specific steps. Please close if it cannot be reproduced.

r4victor avatar Apr 22 '24 12:04 r4victor

Ok, after adding a non-existent command to get_docker_commands() I managed to reproduce the issue. RunPod tries to rerun the docker_args over and over. Here's the pod logs:

2024-04-24T05:07:04.637000785Z bash: failll: command not found
2024-04-24T05:07:11.849114593Z bash: failll: command not found
2024-04-24T05:07:27.842020164Z bash: failll: command not found
2024-04-24T05:07:43.806167622Z bash: failll: command not found
2024-04-24T05:07:59.809122299Z bash: failll: command not found
2024-04-24T05:08:15.816833559Z bash: failll: command not found
2024-04-24T05:08:31.683330842Z bash: failll: command not found
2024-04-24T05:08:47.613584462Z bash: failll: command not found

This behavior is not desirable since we don't fail the pod fast. But after #1149 it's not critical – there is a pod waiting timeout after which the pod will be termianted.

r4victor avatar Apr 24 '24 05:04 r4victor

@bihan return f"bash -c '{command_escaped} && exit 1'".

The exit 1 is the last command, it stands after the runner. Put exit 1 as first command in script. return f"bash -c 'exit 1; {command_escaped}'".

TheBits avatar Apr 24 '24 05:04 TheBits

@r4victor The command errorcommand is your user script and it runs and fails inside dstack-runner. After the end of run of the dstack-runner, we will terminate the container. As a result, there is no issue.

TheBits avatar Apr 24 '24 05:04 TheBits

@TheBits, see my last comment https://github.com/dstackai/dstack/issues/1142#issuecomment-2074042344

I described what happens when putting a bad command in docker_args directly.

r4victor avatar Apr 24 '24 06:04 r4victor

This issue is stale because it has been open for 30 days with no activity.

peterschmidt85 avatar May 25 '24 01:05 peterschmidt85