consul icon indicating copy to clipboard operation
consul copied to clipboard

Consul fails to sign new Connect certs after some time

Open karelorigin opened this issue 2 years ago • 2 comments

Overview of the Issue

A while ago, I setup Consul's intermediate CA by following the tutorial here: https://developer.hashicorp.com/consul/tutorials/vault-secure/vault-pki-consul-connect-ca. Additionally, I'm using the Nomad connect stanza to setup a service mesh. Though after ~3 days, my Nomad deployments fail and Consul keeps throwing 403 Vault errors that I cannot seem to figure out.

As far as I can tell, the Vault policy for Consul has all the appropriate permissions. The local certificates used for authentication are renewed by Vault agent.

I've come up with a few possible explanations for the issue:

  • Consul sends an old token to Vault (doesn't always renew the in-memory token properly).
  • Intermediate rotation is causing issues somehow (not sure how it's related but the ~72h interval is suspicious).
  • Consul was running with two server nodes, which won't guarantee any fault tolerance, but might cause inconsistencies(?)

Consul errors:

Jan 28 03:30:32 tf-srv-quirky-bassi consul[278885]: 2023-01-28T03:30:32.994Z [WARN]  agent.cache: handling error in Cache.Notify: cache-type=connect-ca-leaf
Jan 28 03:30:32 tf-srv-quirky-bassi consul[278885]:   error=
Jan 28 03:30:32 tf-srv-quirky-bassi consul[278885]:   | rpc error making call: error issuing cert: Error making API request.
Jan 28 03:30:32 tf-srv-quirky-bassi consul[278885]:   |
Jan 28 03:30:32 tf-srv-quirky-bassi consul[278885]:   | URL: PUT https://vault.service.consul:8200/v1/connect-intermediate-dc1/sign/leaf-cert
Jan 28 03:30:32 tf-srv-quirky-bassi consul[278885]:   | Code: 403. Errors:
Jan 28 03:30:32 tf-srv-quirky-bassi consul[278885]:   |
Jan 28 03:30:32 tf-srv-quirky-bassi consul[278885]:   | * permission denied
Jan 28 03:30:32 tf-srv-quirky-bassi consul[278885]:    index=425043
Jan 28 03:42:12 tf-srv-quirky-bassi consul[278885]: 2023-01-28T03:42:12.173Z [WARN]  agent.cache: handling error in Cache.Notify: cache-type=connect-ca-leaf
Jan 28 03:42:12 tf-srv-quirky-bassi consul[278885]:   error=
Jan 28 03:42:12 tf-srv-quirky-bassi consul[278885]:   | rpc error making call: error issuing cert: Error making API request.
Jan 28 03:42:12 tf-srv-quirky-bassi consul[278885]:   |
Jan 28 03:42:12 tf-srv-quirky-bassi consul[278885]:   | URL: PUT https://vault.service.consul:8200/v1/connect-intermediate-dc1/sign/leaf-cert
Jan 28 03:42:12 tf-srv-quirky-bassi consul[278885]:   | Code: 403. Errors:
Jan 28 03:42:12 tf-srv-quirky-bassi consul[278885]:   |
Jan 28 03:42:12 tf-srv-quirky-bassi consul[278885]:   | * permission denied
Jan 28 03:42:12 tf-srv-quirky-bassi consul[278885]:    index=425043

Consul Vault provider config:

connect {
    enabled     = true
    ca_provider = "vault"

    ca_config {
      address               = "https://vault.service.consul:8200"
      root_pki_path         = "pki"
      intermediate_pki_path = "connect-intermediate-dc1"
      cert_file             = "/opt/consul/tls/consul.crt.pem"
      key_file              = "/opt/consul/tls/consul.key.pem"

      auth_method {
          type  = "cert"

          params = {
              name = "consul-cluster"
          }
      }
   }
}

Vault consul-cluster policy:

# Allow issueing certificates under the consul-cluster role
path "pki/issue/consul-cluster" {
  capabilities = ["update"]
}

# Allow listing existing PKI mounts
path "/sys/mounts" {
  capabilities = ["read"]
}

# Allow reading configuration of the PKI secrets engine
path "/sys/mounts/pki" {
  capabilities = ["read"]
}

# Allow reading configuration of the Connect intermediate
path "/sys/mounts/connect-intermediate-dc1" {
  capabilities = ["read"]
}

# Allow tuning configuration of the Connect intermediate
path "/sys/mounts/connect-intermediate-dc1/tune" {
  capabilities = ["update"]
}

# Allow basic interaction with PKI secrets engine
path "/pki/" {
  capabilities = ["read"]
}

# Allow signing intermediates
path "/pki/root/sign-intermediate" {
  capabilities = ["update"]
}

# Allow all/any interactions with the Connect intermediate
path "/connect-intermediate-dc1/*" {
  capabilities = ["create", "read", "update", "delete", "list"]
}

# Allow the renewal of own token
path "auth/token/renew-self" {
  capabilities = ["update"]
}

# Allow looking up own token
path "auth/token/lookup-self" {
  capabilities = ["read"]
}

Reproduction Steps

Steps to reproduce this issue, eg:

  1. Create a cluster with n client nodes n and n server nodes
  2. Run curl ...
  3. View error

Consul info for both Client and Server

Client info
agent:
	check_monitors = 0
	check_ttls = 0
	checks = 0
	services = 0
build:
	prerelease = 
	revision = bd257019
	version = 1.14.3
	version_metadata = 
consul:
	acl = disabled
	bootstrap = false
	known_datacenters = 1
	leader = true
	leader_addr = 10.18.248.75:8300
	server = true
raft:
	applied_index = 518042
	commit_index = 518042
	fsm_pending = 0
	last_contact = 0
	last_log_index = 518042
	last_log_term = 11
	last_snapshot_index = 508031
	last_snapshot_term = 11
	latest_configuration = [{Suffrage:Voter ID:8417e1d8-97d4-ab5e-c5c3-7a092a75b29f Address:10.18.30.85:8300} {Suffrage:Voter ID:e08569e7-2d8c-dfc2-9a43-95d5d4049a0f Address:10.18.248.75:8300}]
	latest_configuration_index = 0
	num_peers = 1
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Leader
	term = 11
runtime:
	arch = amd64
	cpu_count = 2
	goroutines = 423
	max_procs = 2
	os = linux
	version = go1.19.4
serf_lan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 10
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 34956
	members = 10
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 4
	members = 2
	query_queue = 0
	query_time = 1
Server info

Nomad server: v1.4.3

Operating system and Environment details

Linux 5.4.0-132-generic #148-Ubuntu SMP Mon Oct 17 16:02:06 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

karelorigin avatar Jan 30 '23 13:01 karelorigin

Hi @karelorigin,

Your Vault policy looks correct to me. It follows the suggestions in the documentation for "Vault managed PKI paths".

You mentioned that things start to break after 72 hours. Consul service mesh leaf certificates have a 72 hour TTL by default. The error message involving Vault API path /v1/connect-intermediate-dc1/sign/leaf-cert suggests that Consul is attempting to generate a new leaf certificate to replace one approaching its 72 hour expiry, but that operation is failing. After 72 hours, the leaf certificate expires without having been replaced.

Are you observing any failed Vault API calls to auth/token/renew-self in the logs? Or to auth/token/lookup-self?

jkirschner-hashicorp avatar Jan 30 '23 17:01 jkirschner-hashicorp

Hey @jkirschner-hashicorp,

Thanks for looking into this so quickly! :D. I've grepped through the logs for anything Vault related and didn't find any errors for /renew-self or /lookup-self. The only other error (that I hadn't spotted before) is a connect.ca.vault login error that seems to occur once a day.

Jan 29 13:22:02 tf-srv-xenodochial-wing consul[128357]: 2023-01-29T13:22:02.030Z [ERROR] connect.ca.vault: Error login in to Vault with %q auth method: EXTRA_VALUE_AT_END=cert

That could explain the problem. I suspect that it might be related to https://github.com/hashicorp/vault/issues/18562, which is Vault bug I recently discovered and reported. It would make sense for Consul to inherit the issue, since it relies on Vault libraries.

Note: The authentication certs are valid for 24 hours and renewed twice a day. Certificate validity is not the issue.

karelorigin avatar Jan 30 '23 19:01 karelorigin

It looks like that Vault fix is intended to be released in Vault 1.14.0: https://github.com/hashicorp/vault/pull/19002#issuecomment-1479733111

jkirschner-hashicorp avatar Apr 18 '23 15:04 jkirschner-hashicorp

Closing as resolved by hashicorp/vault#19002

david-yu avatar Jul 28 '23 04:07 david-yu