conjur
conjur copied to clipboard
Docker container fails to restart
Summary
The cyberark/conjur docker container does not restart gracefully. It leaves a stale pidfile behind, and then refuses to start.
Steps to Reproduce
- Follow the quickstart setup instructions.
- Restart the server container:
docker restart conjur_server
(Restarting the docker host machine is also sufficient to reproduce the problem.)
Expected Results
Conjur server serves conjurs after the restart.
Actual Results (including error logs, if applicable)
Server did not restart properly. Clients get "connection refused" attempting to contact the server.
docker logs conjur_server contains an error message about how the PID file already exists.
This log contains output from both the first run and the second (failed) one:
authn-local is listening at /run/authn-local/.socket
=> Booting Puma
=> Rails 5.2.6 application starting in production
=> Run `rails server -h` for more startup options
[19] Puma starting in cluster mode...
[19] * Puma version: 5.3.2 (ruby 2.5.8-p224) ("Sweetnighter")
[19] * Min threads: 5
[19] * Max threads: 5
[19] * Environment: development
[19] * Master PID: 19
[19] * Workers: 2
[19] * Restarts: (✔) hot (✖) phased
[19] * Preloading application
[19] * Listening on http://0.0.0.0:80
[19] Use Ctrl-C to stop
CONJ00038I OpenSSL FIPS mode set to true
Loaded configuration:
- trusted_proxies from defaults
- authenticators from defaults
[19] - Worker 0 (PID: 24) booted in 0.0s, phase: 0
Loaded configuration:
- trusted_proxies from defaults
- authenticators from defaults
[19] - Worker 1 (PID: 28) booted in 0.0s, phase: 0
error: SIGTERM
A server is already running. Check /opt/conjur-server/tmp/pids/server.pid.
=> Booting Puma
=> Rails 5.2.6 application starting in production
=> Run `rails server -h` for more startup options
Exiting
authn-local is listening at /run/authn-local/.socket
Search the above for server.pid.
Reproducible
I don't know if it's 100%, but it occurs at least 50% of the time for me. Happens often for me, for the past year or more, whenever system updates on the docker host machine require a reboot.
Version/Tag number
Latest. Currently failing on docker image sha256:3f552a4b683b064e45265ba875f6fcc797170a8a3f93ff90e81e5f9df337682e, tagged as 1.13.1.
Environment setup
This happens in the environment set up by following the quickstart instructions without any modifications.
Docker version 20.10.7, build 20.10.7-0ubuntu1~20.04.2
With minor changes to the docker-compose.yml file (just adding "docker://" prefixes), I also see the same problem with podman-compose.
podman version 3.0.1
Additional Information
Once the stale pidfile is present, the server will NEVER restart until it is removed. It can be removed as follows:
docker exec conjur_server rm /opt/conjur-server/tmp/pids/server.pid; docker restart conjur_server.
When the server is in the bad state, docker top conjur_server shows fewer processes running.
Good:
USER PID PPID %CPU ELAPSED TTY TIME COMMAND
root 1 0 0.000 1m4.601003137s ? 0s ruby /usr/local/bin/conjurctl server
root 10 1 0.000 1m1.601150286s ? 0s sh -c
rails server -p '80' -b '0.0.0.0'
root 13 1 1.623 1m1.601969553s ? 1s ruby /var/lib/ruby/bin/rake authn_local:run
root 16 1 3.247 1m1.602483189s ? 2s ruby /var/lib/ruby/bin/rake expiration:watch
root 19 10 3.247 1m1.603541028s ? 2s puma 5.3.2 (tcp://0.0.0.0:80) [Conjur API Server]
root 24 19 0.000 59.603698347s ? 0s puma: cluster worker 0: 19 [Conjur API Server]
root 28 19 0.000 59.603843814s ? 0s puma: cluster worker 1: 19 [Conjur API Server]
Bad:
USER PID PPID %CPU ELAPSED TTY TIME COMMAND
root 1 0 0.000 6m45.406072271s ? 0s ruby /usr/local/bin/conjurctl server
root 13 1 0.248 6m43.406187601s ? 1s ruby /var/lib/ruby/bin/rake authn_local:run
root 16 1 0.496 6m43.406293512s ? 2s ruby /var/lib/ruby/bin/rake expiration:watch
I think that the docker init script should clean up stale PID files. Alternately, the server process could check whether a process with that pid is running, and is not the current process id.
Thanks for posting this issue @Infinoid. I was able to reproduce. I happened to have the v1.11.6 image and noticed that it gracefully deals with those stale PID files, compared to .v1.13.1 which does not. I looked at the diff between v1.11.6 and .v1.13.1, nothing seems out of the ordinary.
Right now it's not clear what's causing this behavior, and so this will likely require further investigation.
A quick fix is to comment out the command and specify an entrypoint that cleans up the PID file on the conjur service in your docker-compose.yml
# command: server
entrypoint: ["sh", "-c" , "rm -f /opt/conjur-server/tmp/pids/server.pid; conjurctl server"]