conjur Docker container fails to restart

Docker container fails to restart

Open Infinoid opened this issue 4 years ago • 1 comments

Summary

The cyberark/conjur docker container does not restart gracefully. It leaves a stale pidfile behind, and then refuses to start.

Steps to Reproduce

Follow the quickstart setup instructions.
Restart the server container: docker restart conjur_server

(Restarting the docker host machine is also sufficient to reproduce the problem.)

Expected Results

Conjur server serves conjurs after the restart.

Actual Results (including error logs, if applicable)

Server did not restart properly. Clients get "connection refused" attempting to contact the server.

docker logs conjur_server contains an error message about how the PID file already exists.

This log contains output from both the first run and the second (failed) one:

authn-local is listening at /run/authn-local/.socket
=> Booting Puma
=> Rails 5.2.6 application starting in production 
=> Run `rails server -h` for more startup options
[19] Puma starting in cluster mode...
[19] * Puma version: 5.3.2 (ruby 2.5.8-p224) ("Sweetnighter")
[19] *  Min threads: 5
[19] *  Max threads: 5
[19] *  Environment: development
[19] *   Master PID: 19
[19] *      Workers: 2
[19] *     Restarts: (✔) hot (✖) phased
[19] * Preloading application
[19] * Listening on http://0.0.0.0:80
[19] Use Ctrl-C to stop
CONJ00038I OpenSSL FIPS mode set to true
Loaded configuration:
- trusted_proxies from defaults
- authenticators from defaults
[19] - Worker 0 (PID: 24) booted in 0.0s, phase: 0
Loaded configuration:
- trusted_proxies from defaults
- authenticators from defaults
[19] - Worker 1 (PID: 28) booted in 0.0s, phase: 0
error: SIGTERM
A server is already running. Check /opt/conjur-server/tmp/pids/server.pid.
=> Booting Puma
=> Rails 5.2.6 application starting in production 
=> Run `rails server -h` for more startup options
Exiting
authn-local is listening at /run/authn-local/.socket

Search the above for server.pid.

Reproducible

I don't know if it's 100%, but it occurs at least 50% of the time for me. Happens often for me, for the past year or more, whenever system updates on the docker host machine require a reboot.

Version/Tag number

Latest. Currently failing on docker image sha256:3f552a4b683b064e45265ba875f6fcc797170a8a3f93ff90e81e5f9df337682e, tagged as 1.13.1.

Environment setup

This happens in the environment set up by following the quickstart instructions without any modifications.

Docker version 20.10.7, build 20.10.7-0ubuntu1~20.04.2

With minor changes to the docker-compose.yml file (just adding "docker://" prefixes), I also see the same problem with podman-compose.

podman version 3.0.1

Additional Information

Once the stale pidfile is present, the server will NEVER restart until it is removed. It can be removed as follows:

docker exec conjur_server rm /opt/conjur-server/tmp/pids/server.pid; docker restart conjur_server.

When the server is in the bad state, docker top conjur_server shows fewer processes running.

Good:

USER   PID   PPID   %CPU    ELAPSED          TTY   TIME   COMMAND
root   1     0      0.000   1m4.601003137s   ?     0s     ruby /usr/local/bin/conjurctl server 
root   10    1      0.000   1m1.601150286s   ?     0s     sh -c 
          rails server -p '80' -b '0.0.0.0'
         
root   13   1    1.623   1m1.601969553s   ?    1s   ruby /var/lib/ruby/bin/rake authn_local:run 
root   16   1    3.247   1m1.602483189s   ?    2s   ruby /var/lib/ruby/bin/rake expiration:watch 
root   19   10   3.247   1m1.603541028s   ?    2s   puma 5.3.2 (tcp://0.0.0.0:80) [Conjur API Server]        
root   24   19   0.000   59.603698347s    ?    0s   puma: cluster worker 0: 19 [Conjur API Server]           
root   28   19   0.000   59.603843814s    ?    0s   puma: cluster worker 1: 19 [Conjur API Server]

Bad:

USER   PID   PPID   %CPU    ELAPSED           TTY   TIME   COMMAND
root   1     0      0.000   6m45.406072271s   ?     0s     ruby /usr/local/bin/conjurctl server 
root   13    1      0.248   6m43.406187601s   ?     1s     ruby /var/lib/ruby/bin/rake authn_local:run 
root   16    1      0.496   6m43.406293512s   ?     2s     ruby /var/lib/ruby/bin/rake expiration:watch

I think that the docker init script should clean up stale PID files. Alternately, the server process could check whether a process with that pid is running, and is not the current process id.

Oct 06 '21 12:10 Infinoid

Thanks for posting this issue @Infinoid. I was able to reproduce. I happened to have the v1.11.6 image and noticed that it gracefully deals with those stale PID files, compared to .v1.13.1 which does not. I looked at the diff between v1.11.6 and .v1.13.1, nothing seems out of the ordinary.

Right now it's not clear what's causing this behavior, and so this will likely require further investigation.

A quick fix is to comment out the command and specify an entrypoint that cleans up the PID file on the conjur service in your docker-compose.yml

#    command: server
    entrypoint: ["sh", "-c" , "rm -f /opt/conjur-server/tmp/pids/server.pid; conjurctl server"]

Oct 07 '21 10:10 doodlesbykumbi

conjur conjur copied to clipboard

Docker container fails to restart

Summary

Steps to Reproduce

Expected Results

Actual Results (including error logs, if applicable)

Reproducible

Version/Tag number

Environment setup

Additional Information

conjur
conjur copied to clipboard