server icon indicating copy to clipboard operation
server copied to clipboard

Container enter crash loop when machine restarts

Open opsxcq opened this issue 4 years ago • 8 comments

When the machine where the docker container running wandb restarts, the container enters in a crash loop, the only way to solve the problem is to remove the container and create it again.

Image id: 9b661d2d9510

*** Killing all processes...
*** Running /etc/my_init.d/00_regen_ssh_host_keys.sh...
*** Running /etc/my_init.d/01_enable-services.sh...
*** Copying services to runit
mv: cannot stat '/home/wandb/service/*': No such file or directory
mv: cannot stat '/home/wandb/wandb-logrotate': No such file or directory
*** Copying jobber template
*** Enabling production mode
ln: failed to create symbolic link '/etc/nginx/sites-enabled/wandb.conf': File exists
*** /etc/my_init.d/01_enable-services.sh failed with status 1

*** Killing all processes...
*** Running /etc/my_init.d/00_regen_ssh_host_keys.sh...
*** Running /etc/my_init.d/01_enable-services.sh...
*** Copying services to runit
mv: cannot stat '/home/wandb/service/*': No such file or directory
mv: cannot stat '/home/wandb/wandb-logrotate': No such file or directory
*** Copying jobber template
*** Enabling production mode
ln: failed to create symbolic link '/etc/nginx/sites-enabled/wandb.conf': File exists
*** /etc/my_init.d/01_enable-services.sh failed with status 1

*** Killing all processes...
*** Running /etc/my_init.d/00_regen_ssh_host_keys.sh...
*** Running /etc/my_init.d/01_enable-services.sh...
*** Copying services to runit
mv: cannot stat '/home/wandb/service/*': No such file or directory
mv: cannot stat '/home/wandb/wandb-logrotate': No such file or directory
*** Copying jobber template
*** Enabling production mode
ln: failed to create symbolic link '/etc/nginx/sites-enabled/wandb.conf': File exists
*** /etc/my_init.d/01_enable-services.sh failed with status 1

*** Killing all processes...
*** Running /etc/my_init.d/00_regen_ssh_host_keys.sh...
*** Running /etc/my_init.d/01_enable-services.sh...
*** Copying services to runit
mv: cannot stat '/home/wandb/service/*': No such file or directory
mv: cannot stat '/home/wandb/wandb-logrotate': No such file or directory
*** Copying jobber template
*** Enabling production mode
ln: failed to create symbolic link '/etc/nginx/sites-enabled/wandb.conf': File exists
*** /etc/my_init.d/01_enable-services.sh failed with status 1

*** Killing all processes...
*** Running /etc/my_init.d/00_regen_ssh_host_keys.sh...
*** Running /etc/my_init.d/01_enable-services.sh...
*** Copying services to runit
mv: cannot stat '/home/wandb/service/*': No such file or directory
mv: cannot stat '/home/wandb/wandb-logrotate': No such file or directory
*** Copying jobber template
*** Enabling production mode
ln: failed to create symbolic link '/etc/nginx/sites-enabled/wandb.conf': File exists
*** /etc/my_init.d/01_enable-services.sh failed with status 1

*** Killing all processes...

opsxcq avatar Sep 21 '21 12:09 opsxcq

Yeah, this is a known issue with the way docker does restarts. You should be able to run the container as system level service that should handle hard restarts. This blog has some good examples.

vanpelt avatar Sep 21 '21 18:09 vanpelt

Hey @vanpelt, I looked into one the Docker image, because was not able to find the corresponding repository for it. I found a init script at /etc/my_init.d/01_enable-services.sh, which contains this:

#!/bin/bash

# move all services to runit, was tricky to make this happen in docker without
# overwriting cron / sshd
echo "*** Copying services to runit"
mv /home/wandb/service/* /etc/service/
mv /home/wandb/wandb-logrotate /etc/logrotate.d/

I think I understand the setup process in /sbin/my_init, but I see no reason why to move the files, instead of copy them recursively. That would make that single step more resilient, wouldn't it?

Additional at the end of the file:

if [[ ! -z "${LOCAL_DEV}" ]]; then
    echo "*** Enabling development mode"
    touch /etc/service/gorilla/down
    ln -s /etc/nginx/sites-available/wandb-dev.conf /etc/nginx/sites-enabled/wandb.conf
else
    echo "*** Enabling production mode"
    ln -s /etc/nginx/sites-available/wandb-prod.conf /etc/nginx/sites-enabled/wandb.conf
fi

Is there a reason, why not to force the override, i.e. like this:

ln -sf /etc/nginx/sites-available/wandb-prod.conf /etc/nginx/sites-enabled/wandb.conf

byteSamurai avatar Feb 01 '22 14:02 byteSamurai

Hey @opsxcq both great suggestions. We'll look into implementing these in a future release.

vanpelt avatar Feb 01 '22 23:02 vanpelt

@vanpelt If you tell me repository, I would be happy to send you a tested PR

byteSamurai avatar Feb 02 '22 07:02 byteSamurai

Thanks @byteSamurai, the repo containing the source is currently private. I created a gist with the init scripts. I can apply the diff and be sure to mention you in the release notes!

vanpelt avatar Feb 02 '22 07:02 vanpelt

@vanpelt sorry for the delay. I changed the discussed lines here. How I tested:

  • created a local image, overwriting only /etc/my_init.d/01_enable-services.sh
  • Starting and restarting the container several times works, so I consider it as fixed.

What do you think?

byteSamurai avatar Feb 03 '22 14:02 byteSamurai

Awesome! We'll get this into the next release.

vanpelt avatar Feb 04 '22 04:02 vanpelt

Cool, could you let me know when this is the case, @vanpelt ? So I can replace my locally build image :)

Also, this guideline/template might be interesting to whoever is in charge of your shell script.

byteSamurai avatar Feb 04 '22 15:02 byteSamurai