boundary icon indicating copy to clipboard operation
boundary copied to clipboard

Health endpoint does not flag an unhealthy Controller

Open g-psantos opened this issue 2 years ago • 1 comments
trafficstars

Describe the bug The /health endpoint returns status 200 as the Controller returns 500 errors to users. Our setup is described in "Additional context" below.

To Reproduce Steps to reproduce the behavior:

  1. Configure the Controller with the Vault Transit KMS.
  2. Configure the Vault Transit backend and create two Vault Transit keys.
  3. Configure a Vault authentication backend and an authentication role for the Boundary Controller. a. Grant the role permissions to the Transit keys created in (2). b. For quick testing purposes, set token_max_ttl=5m and token_ttl=3m.
  4. Configure the Controller with something along the lines of the configuration file below.
  5. Run the Controller with a Vault token issued by authenticating the role created in (3): VAULT_TOKEN=[your token] boundary -config=controller.hcl.
  6. After five minutes, the Vault token will reach its maximum TTL. The 127.0.0.1:9203/health endpoint will indicate that all is well, but you will be unable to access the Controller UI (unable to retrieve authentication methods).
# Sample configuration for the controller
controller {
  name                = "env://HOSTNAME"
  public_cluster_addr = "127.0.0.1:9201"
  database {
    url = "env://POSTGRES_CONNECTION_URL"
  }
}

listener "tcp" {
  purpose = "api"
  address = "127.0.0.1:9200"
  tls_disable = true
}

listener "tcp" {
  purpose = "cluster"
  address = "127.0.0.1:9201"
}

listener "tcp" {
  purpose = "ops"
  address = "127.0.0.1:9203"
  tls_disable = true
}

kms "transit" {
  purpose     = "root"
  mount_path  = "transit"
  key_name    = "boundary-root"
  # You may need to set TLS configuration variables
}

kms "transit" {
  purpose     = "worker-auth"
  mount_path  = "transit"
  key_name    = "boundary-worker-auth"
  # You may need to set TLS configuration variables
}

Expected behavior The /health endpoint should return an HTTP status that indicates the Controller is in an unhealthy state when users are unable to interact with the Controller.

Additional context We use the Vault Transit KMS backend for root and worker-auth. Initial authentication to Vault is handled by the Vault Agent injector, which writes the Vault token to a file accessible by the Boundary Controller container (both Controller and Vault are running in Kubernetes). The Controller process is then launched with the initial token in an environment variable -- VAULT_TOKEN=$(cat [file]) boundary ....

Boundary takes care of renewing its Vault token until the token reaches its maximum TTL. At that point, a new Vault token has to be obtained by reauthenticating to Vault. While Vault Agent can handle that aspect, Boundary never picks up on the new token, as the environment variable is set when the Boundary process starts.

Ideally, the /health endpoint would at that stage indicate that "not all is well" with the Controller and Kubernetes would take care of restarting the deployment. Alas, the endpoint returns a status of 200 even though users cannot interact with Boundary at all (or at least not authenticate to it).

Note that our specific problem could be addressed by having Boundary read Vault tokens from the file directly (see #987 and the upstream hashicorp/vault#11270) or by having Boundary restarted before max. TTL is reached (the workaround we'll implement for the time being). However, I still think it is improper for a healthcheck endpoint to return an "OK" status when the application is unusable, so I decided to file this as a bug report nonetheless (there could also be other cases where this happens).

g-psantos avatar Mar 01 '23 15:03 g-psantos

As a follow-up, this is something we're looking into & triaging internally. @g-psantos we'll report back once we have more information.

AdamBouhmad avatar Mar 08 '23 23:03 AdamBouhmad