server
server copied to clipboard
Container enter crash loop when machine restarts
When the machine where the docker container running wandb restarts, the container enters in a crash loop, the only way to solve the problem is to remove the container and create it again.
Image id: 9b661d2d9510
*** Killing all processes...
*** Running /etc/my_init.d/00_regen_ssh_host_keys.sh...
*** Running /etc/my_init.d/01_enable-services.sh...
*** Copying services to runit
mv: cannot stat '/home/wandb/service/*': No such file or directory
mv: cannot stat '/home/wandb/wandb-logrotate': No such file or directory
*** Copying jobber template
*** Enabling production mode
ln: failed to create symbolic link '/etc/nginx/sites-enabled/wandb.conf': File exists
*** /etc/my_init.d/01_enable-services.sh failed with status 1
*** Killing all processes...
*** Running /etc/my_init.d/00_regen_ssh_host_keys.sh...
*** Running /etc/my_init.d/01_enable-services.sh...
*** Copying services to runit
mv: cannot stat '/home/wandb/service/*': No such file or directory
mv: cannot stat '/home/wandb/wandb-logrotate': No such file or directory
*** Copying jobber template
*** Enabling production mode
ln: failed to create symbolic link '/etc/nginx/sites-enabled/wandb.conf': File exists
*** /etc/my_init.d/01_enable-services.sh failed with status 1
*** Killing all processes...
*** Running /etc/my_init.d/00_regen_ssh_host_keys.sh...
*** Running /etc/my_init.d/01_enable-services.sh...
*** Copying services to runit
mv: cannot stat '/home/wandb/service/*': No such file or directory
mv: cannot stat '/home/wandb/wandb-logrotate': No such file or directory
*** Copying jobber template
*** Enabling production mode
ln: failed to create symbolic link '/etc/nginx/sites-enabled/wandb.conf': File exists
*** /etc/my_init.d/01_enable-services.sh failed with status 1
*** Killing all processes...
*** Running /etc/my_init.d/00_regen_ssh_host_keys.sh...
*** Running /etc/my_init.d/01_enable-services.sh...
*** Copying services to runit
mv: cannot stat '/home/wandb/service/*': No such file or directory
mv: cannot stat '/home/wandb/wandb-logrotate': No such file or directory
*** Copying jobber template
*** Enabling production mode
ln: failed to create symbolic link '/etc/nginx/sites-enabled/wandb.conf': File exists
*** /etc/my_init.d/01_enable-services.sh failed with status 1
*** Killing all processes...
*** Running /etc/my_init.d/00_regen_ssh_host_keys.sh...
*** Running /etc/my_init.d/01_enable-services.sh...
*** Copying services to runit
mv: cannot stat '/home/wandb/service/*': No such file or directory
mv: cannot stat '/home/wandb/wandb-logrotate': No such file or directory
*** Copying jobber template
*** Enabling production mode
ln: failed to create symbolic link '/etc/nginx/sites-enabled/wandb.conf': File exists
*** /etc/my_init.d/01_enable-services.sh failed with status 1
*** Killing all processes...
Yeah, this is a known issue with the way docker does restarts. You should be able to run the container as system level service that should handle hard restarts. This blog has some good examples.
Hey @vanpelt,
I looked into one the Docker image, because was not able to find the corresponding repository for it. I found a init script at /etc/my_init.d/01_enable-services.sh, which contains this:
#!/bin/bash
# move all services to runit, was tricky to make this happen in docker without
# overwriting cron / sshd
echo "*** Copying services to runit"
mv /home/wandb/service/* /etc/service/
mv /home/wandb/wandb-logrotate /etc/logrotate.d/
I think I understand the setup process in /sbin/my_init, but I see no reason why to move the files, instead of copy them recursively. That would make that single step more resilient, wouldn't it?
Additional at the end of the file:
if [[ ! -z "${LOCAL_DEV}" ]]; then
echo "*** Enabling development mode"
touch /etc/service/gorilla/down
ln -s /etc/nginx/sites-available/wandb-dev.conf /etc/nginx/sites-enabled/wandb.conf
else
echo "*** Enabling production mode"
ln -s /etc/nginx/sites-available/wandb-prod.conf /etc/nginx/sites-enabled/wandb.conf
fi
Is there a reason, why not to force the override, i.e. like this:
ln -sf /etc/nginx/sites-available/wandb-prod.conf /etc/nginx/sites-enabled/wandb.conf
Hey @opsxcq both great suggestions. We'll look into implementing these in a future release.
@vanpelt If you tell me repository, I would be happy to send you a tested PR
Thanks @byteSamurai, the repo containing the source is currently private. I created a gist with the init scripts. I can apply the diff and be sure to mention you in the release notes!
@vanpelt sorry for the delay. I changed the discussed lines here. How I tested:
- created a local image, overwriting only
/etc/my_init.d/01_enable-services.sh - Starting and restarting the container several times works, so I consider it as fixed.
What do you think?
Awesome! We'll get this into the next release.
Cool, could you let me know when this is the case, @vanpelt ? So I can replace my locally build image :)
Also, this guideline/template might be interesting to whoever is in charge of your shell script.