certificates icon indicating copy to clipboard operation
certificates copied to clipboard

Health check timeout (container state: unhealthy)

Open strarsis opened this issue 4 years ago β€’ 19 comments

Subject of the issue

The step-ca container health state is shown as Up (health: starting), later it will turn to Up (unhealthy). But the service runs fine and it also logs that it is listening now, so apparently the health check always fails.

Your environment

  • OS: WSL 2
  • Version: 2 (Windows 10 x64)

Steps to reproduce

docker-compose.yml:

version: '3.7'

services:

  # Smallstep Step CA
  step-ca:
    image: smallstep/step-ca:0.15.6
    restart: always
> docker-compose up -d
> docker-compose ps
Up (health: starting)
# after some time (about one minute)
> docker-compose ps
Up (unhealthy)

Expected behaviour

As the CA service runs correctly, the health check should pass and the container state should become Up (healthy) or similar, but not Up (unhealthy). Also the health check needs too long ((health: starting)) for one minute.

Actual behaviour

Health check needs too long (Up (health: starting)) and then fails after about a minute (Up (unhealthy)).

strarsis avatar Feb 05 '21 05:02 strarsis

@strarsis Thanks for the report.

I have a couple questions.

  1. Are you able to reach the CA's health check endpoint and get a {"status":"ok"} response? The endpoint is https://ca_host:port/health

  2. I'm not able to reproduce your example docker-compose.yml. The container comes up but the CA doesn't run. How did you initialize your PKI? Do you use a volume mount to store the configuration?

When I try to reproduce this with our Docker tutorial, the container health check works.

tashian avatar Feb 09 '21 18:02 tashian

@tashian: One factor could be that WSL 2 with Docker for Desktop is used.

strarsis avatar Feb 10 '21 10:02 strarsis

@strarsis Please provide more details about the environment and steps to reproduce, so we can test this. Thanks

tashian avatar Feb 10 '21 18:02 tashian

I am using WSL 2 with Docker for Desktop and getting the same issue. Can you please provide info on the factor you mentioned above around using this?

Pete-PlaytimeSolutions avatar May 28 '21 01:05 Pete-PlaytimeSolutions

Hi folks, I am having an issue with this container's healthcheck on docker swarm as well. I can reproduce very easily and I think I know what the problem may be (for me, at least).

docker run -it -e STEPDEBUG=1 smallstep/step-ca:0.16.0 sh
~ $ step ca init
βœ” (e.g. Smallstep): Deiselβ–ˆ
What DNS names or IP addresses would you like to add to your new CA?
βœ” (e.g. ca.smallstep.com[,1.1.1.1,etc.]): ca.diesel.net
What IP and port will your new CA bind to?
βœ” (e.g. :443 or 127.0.0.1:4343): :443
What would you like to name the CA's first provisioner?
βœ” (e.g. [email protected]): test
Choose a password for your CA keys and first provisioner.
βœ” [leave empty and we'll generate one]: β–ˆ
βœ” Password: gj%Dyy-[BuKp#.EP(%vl,!#{`fF4$cH,

Generating root certificate...
all done!

Generating intermediate certificate...
all done!

βœ” Root certificate: /home/step/certs/root_ca.crt
βœ” Root private key: /home/step/secrets/root_ca_key
βœ” Root fingerprint: 2675480ce53fa83431099ddafe152f532ad0a197a6784a1a4641be32969f2578
βœ” Intermediate certificate: /home/step/certs/intermediate_ca.crt
βœ” Intermediate private key: /home/step/secrets/intermediate_ca_key
βœ” Database folder: /home/step/db
βœ” Default configuration: /home/step/config/defaults.json
βœ” Certificate Authority configuration: /home/step/config/ca.json

Your PKI is ready to go. To generate certificates for individual services see 'step help ca'.

FEEDBACK 😍 🍻
      The step utility is not instrumented for usage statistics. It does not
      phone home. But your feedback is extremely valuable. Any information you
      can provide regarding how you’re using `step` helps. Please send us a
      sentence or two, good or bad: [email protected] or join
      https://github.com/smallstep/certificates/discussions.
~ $ step ca health
Get "https://ca.diesel.net/health": x509: certificate is valid for 9721f7d721878f7496b87c17dcab760d.2868b98699e09c78a80c69bee273ddd8.traefik.default, not ca.diesel.net
client.Health; client GET https://ca.diesel.net/health failed
github.com/smallstep/certificates/errs.Wrapf
        /go/pkg/mod/github.com/smallstep/[email protected]/errs/error.go:122
github.com/smallstep/certificates/ca.(*Client).Health
        /go/pkg/mod/github.com/smallstep/[email protected]/ca/client.go:612
github.com/smallstep/cli/command/ca.healthAction
        /src/command/ca/health.go:79
github.com/urfave/cli.HandleAction
        /go/pkg/mod/github.com/urfave/[email protected]/app.go:526
github.com/urfave/cli.Command.Run
        /go/pkg/mod/github.com/urfave/[email protected]/command.go:174
github.com/urfave/cli.(*App).RunAsSubcommand
        /go/pkg/mod/github.com/urfave/[email protected]/app.go:407
github.com/urfave/cli.Command.startApp
        /go/pkg/mod/github.com/urfave/[email protected]/command.go:373
github.com/urfave/cli.Command.Run
        /go/pkg/mod/github.com/urfave/[email protected]/command.go:102
github.com/urfave/cli.(*App).Run
        /go/pkg/mod/github.com/urfave/[email protected]/app.go:279
main.main
        /src/cmd/step/main.go:98
runtime.main
        /usr/local/go/src/runtime/proc.go:225
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1371

I use traefik as a reverse proxy in front of step-ca. It is set up as a simple TCP relay, and is letting the step-ca container handle all of the TLS itself. This was working perfectly in version 0.15.4 (BEFORE the healthcheck was added). For those unfamiliar, Traefik simply looks at docker labels in order to know how to route to the container and do all of its proxying, due to its simplicity it has become a very popular choice among docker swarm and Kubernetes stacks.

TLDR; The problem is that Traefik will not set up the routing until the Healthcheck is passed, however the healthcheck relies on dns resolution and any proxy configuration to already be set up correctly and working in order to succeed.

Is there some way we can disable the health check for more compatibility with a setup like mine? Another thought I had was to perhaps change the healthcheck to check against localhost instead of the configured domain we give it.

tomdaley92 avatar Jul 26 '21 22:07 tomdaley92

@tomdaley92 the problem with your health check is that the dnsNames in the ca.json should have ca.diesel.net too.

Looking at the error it looks like the domain 9721f7d721878f7496b87c17dcab760d.2868b98699e09c78a80c69bee273ddd8.traefik.default is in the one in the ca.json; either that or traefik is decoding the TLS instead of passing the TCP through step-ca.

maraino avatar Jul 26 '21 22:07 maraino

Hi @tomdaley92,

The health check just runs step ca health, which uses the CA url and fingerprint configured in /home/step/config/defaults.json in the container. To get the health check working on your setup, change the CA URL to https://localhost or whatever value will reach the CA directly instead of Traefik. Let me know if this works for you. :D

tashian avatar Jul 26 '21 22:07 tashian

@tomdaley92 the problem with your health check is that the dnsNames in the ca.json should have ca.diesel.net too.

Looking at the error it looks like the domain 9721f7d721878f7496b87c17dcab760d.2868b98699e09c78a80c69bee273ddd8.traefik.default is in the one in the ca.json; either that or traefik is decoding the TLS instead of passing the TCP through step-ca.

Right, so this is exactly my point. Since I have a DNS record pointing to the VM that the proxy lives on, Traefik is throwing up the self signed default certificate since it is unable to do the tcp routing because it views the container as unhealthy. This is like the classic which came first "chicken or the egg" problem haha. Again, when using the older version without the health check traefik picks up the configuration and pass the tcp connection through to the container

tomdaley92 avatar Jul 26 '21 23:07 tomdaley92

Hi @tomdaley92,

The health check just runs step ca health, which uses the CA url and fingerprint configured in /home/step/config/defaults.json in the container. To get the health check working on your setup, change the CA URL to https://localhost or whatever value will reach the CA directly instead of Traefik. Let me know if this works for you. :D

I will try this, but I would assume the domain I feed step-ca with is needed for it to know what certificate to generate/serve when a client hits https://ca.diesel.net for example. If it generates a certificate for localhost and I come in on ca.diesel.net that's gonna cause problems. Maybe I'm missing something so I'll go ahead and give it a try and thank you for the quick reply as well!

tomdaley92 avatar Jul 26 '21 23:07 tomdaley92

@tashian no luck with setting localhost during step ca init. I even tried adding localhost,ca.diesel.net,127.0.01 with no luck either.

Here is my output:

docker run -it -e STEPDEBUG=1 smallstep/step-ca:0.16.0 sh
~ $ step ca init
What would you like to name your new PKI?
βœ” (e.g. Smallstep): Diesel
What DNS names or IP addresses would you like to add to your new CA?
βœ” (e.g. ca.smallstep.com[,1.1.1.1,etc.]): localhostβ–ˆ
What IP and port will your new CA bind to?
βœ” (e.g. :443 or 127.0.0.1:4343): :443β–ˆ
What would you like to name the CA's first provisioner?
βœ” (e.g. [email protected]): [email protected]
Choose a password for your CA keys and first provisioner.
βœ” [leave empty and we'll generate one]:
βœ” Password: b^MAT=f9<v=c$IMzRBz[!253V/,k;u7C

Generating root certificate...
all done!

Generating intermediate certificate...
all done!

βœ” Root certificate: /home/step/certs/root_ca.crt
βœ” Root private key: /home/step/secrets/root_ca_key
βœ” Root fingerprint: a19b0ce1f59fc67f36e362675f77e8c46687e4bbe9e66d3ca8439b45159b1d07
βœ” Intermediate certificate: /home/step/certs/intermediate_ca.crt
βœ” Intermediate private key: /home/step/secrets/intermediate_ca_key
βœ” Database folder: /home/step/db
βœ” Default configuration: /home/step/config/defaults.json
βœ” Certificate Authority configuration: /home/step/config/ca.json

Your PKI is ready to go. To generate certificates for individual services see 'step help ca'.

FEEDBACK 😍 🍻
      The step utility is not instrumented for usage statistics. It does not
      phone home. But your feedback is extremely valuable. Any information you
      can provide regarding how you’re using `step` helps. Please send us a
      sentence or two, good or bad: [email protected] or join
      https://github.com/smallstep/certificates/discussions.
~ $
~ $ step ca health
Get "https://localhost/health": dial tcp 127.0.0.1:443: connect: connection refused
client.Health; client GET https://localhost/health failed
github.com/smallstep/certificates/errs.Wrapf
        /go/pkg/mod/github.com/smallstep/[email protected]/errs/error.go:122
github.com/smallstep/certificates/ca.(*Client).Health
        /go/pkg/mod/github.com/smallstep/[email protected]/ca/client.go:612
github.com/smallstep/cli/command/ca.healthAction
        /src/command/ca/health.go:79
github.com/urfave/cli.HandleAction
        /go/pkg/mod/github.com/urfave/[email protected]/app.go:526
github.com/urfave/cli.Command.Run
        /go/pkg/mod/github.com/urfave/[email protected]/command.go:174
github.com/urfave/cli.(*App).RunAsSubcommand
        /go/pkg/mod/github.com/urfave/[email protected]/app.go:407
github.com/urfave/cli.Command.startApp
        /go/pkg/mod/github.com/urfave/[email protected]/command.go:373
github.com/urfave/cli.Command.Run
        /go/pkg/mod/github.com/urfave/[email protected]/command.go:102
github.com/urfave/cli.(*App).Run
        /go/pkg/mod/github.com/urfave/[email protected]/app.go:279
main.main
        /src/cmd/step/main.go:98
runtime.main
        /usr/local/go/src/runtime/proc.go:225
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1371

Again, that's just an adhoc command to debug, but my actual stack is on docker swarm:

https://github.com/Diesel-Net/step-ca https://github.com/Diesel-Net/traefik

tomdaley92 avatar Jul 26 '21 23:07 tomdaley92

If I replace ca.diesel.net with localhost in defaults.json AFTER step ca init as a sort of override, and then redeploy the service, it is able to resolve localhost to 127.0.0.1 inside the container, but still fails due to a bad certificate, which is what I expected.

image

tomdaley92 avatar Jul 26 '21 23:07 tomdaley92

Again thanks for the help everyone, don't mean to blow up this thread. Just posting my findings.

It looks like Traefik doesn't have a way to "not require" health checks in order to set up proxy configurations either

https://github.com/traefik/traefik/issues/7732

Looks like my only option might be to either enable TLS termination on traefik (but traefik uses Step-ca as acme client so another chicken and egg problem) or to build a custom docker image without the healthcheck.

It would be awesome if the healthcheck was made configurable but I can see why this might dismissed pretty quick

tomdaley92 avatar Jul 27 '21 00:07 tomdaley92

The name in the health check URL has to match a name in dnsNames in ca.json. So, use ca.diesel.net,127.0.0.1,localhost in ca.json, and then change defaults.json to use localhost (or 127.0.0.1).

When you got connect: connection refused above, it looks like you hadn't yet started the step-ca server.

tashian avatar Jul 27 '21 00:07 tashian

The name in the health check URL has to match a name in dnsNames in ca.json. So, use ca.diesel.net,127.0.0.1,localhost in ca.json, and then change defaults.json to use localhost (or 127.0.0.1).

When you got connect: connection refused above, it looks like you hadn't yet started the step-ca server.

Ahh I will try that, thanks again. FYI that command output connect: connection refused is all running inside the container hence the -it flag on the docker run command. Is there some other step command I'm supposed to run after step ca init in order to "start" the server?

tomdaley92 avatar Jul 27 '21 00:07 tomdaley92

Wewt that was it @tashian I now have a succesfull healthcheck! thank you taking time out of your day to help me with this, really appreciate it πŸ‘

tomdaley92 avatar Jul 27 '21 00:07 tomdaley92

Happy to help. The command to start step-ca in the container is /usr/local/bin/step-ca --password-file $PWDPATH $CONFIGPATH (see the Dockerfile's CMD line)

tashian avatar Jul 27 '21 00:07 tashian

@tomdaley92 Ok I see what's going on with your last output, step-ca is not running.

Looking at your ansible configuration in your github, you're mounting a pre-created configuration, good. So to imitate this using docker run, you need to pre-create the configuration with step ca init, and make sure that the paths in ca.json and defaults.json point to /home/step/* instead of your local path.

Then start the ca with the volume mounted, using the default command and running the health check:

docker run --mount type=bind,source="/tmp/docker",target=/home/step -it -e STEPDEBUG=1 smallstep/step-ca:0.16.0

And in another terminal, exec in and try to health:

$ docker exec -it 10ce907bea0e sh
~ $ ps
PID   USER     TIME  COMMAND
    1 step      0:00 /usr/local/bin/step-ca --password-file /home/step/secrets/password /home/step/config/ca.json
   47 step      0:00 sh
   55 step      0:00 ps
~ $ step ca health
ok

And if you look at the output of the step-ca, you will see that health check (the one in the docker file) is running every 30s

maraino avatar Jul 27 '21 01:07 maraino

Follow up with recent smallstep/step-ca:0.20.0: After starting the container, for quite a time its status is running (starting) and then turns to running (unhealthy).

strarsis avatar Jun 14 '22 23:06 strarsis

I ran into the same DNS resolution problem when using docker swarm. Instead of modifying ca.json and defaults.json, I used the extra_hosts service option to provide DNS resolution in the container of ca.diesel.net to 127.0.0.1. This is my stack compose file. Note that it takes around 30 seconds for the service to finish coming up.

version: '3.4'

networks:
  step:
    external: true

volumes:
  step_home_step:
    external: true

services:
  step:
    extra_hosts:
      - 'ca.diesel.net:127.0.0.1'
    image: smallstep/step-ca:0.20.0
    networks:
      - step
    volumes:
      - source: step_home_step
        target: /home/step
        type: volume
        volume:
          nocopy: true

sunvalleyfoods avatar Jun 28 '22 14:06 sunvalleyfoods

@sunvalleyfoods that's a nice approach! I'm going to convert this into a discussion so people can find it for posterity. @strarsis if you're still encountering this issue, could you please open a new issue and provide some more context about your deployment of step-ca?

tashian avatar Aug 17 '22 22:08 tashian