consul
consul copied to clipboard
Consul fails to sign new Connect certs after some time
Overview of the Issue
A while ago, I setup Consul's intermediate CA by following the tutorial here: https://developer.hashicorp.com/consul/tutorials/vault-secure/vault-pki-consul-connect-ca. Additionally, I'm using the Nomad connect
stanza to setup a service mesh. Though after ~3 days, my Nomad deployments fail and Consul keeps throwing 403 Vault errors that I cannot seem to figure out.
As far as I can tell, the Vault policy for Consul has all the appropriate permissions. The local certificates used for authentication are renewed by Vault agent.
I've come up with a few possible explanations for the issue:
- Consul sends an old token to Vault (doesn't always renew the in-memory token properly).
- Intermediate rotation is causing issues somehow (not sure how it's related but the ~72h interval is suspicious).
- Consul was running with two server nodes, which won't guarantee any fault tolerance, but might cause inconsistencies(?)
Consul errors:
Jan 28 03:30:32 tf-srv-quirky-bassi consul[278885]: 2023-01-28T03:30:32.994Z [WARN] agent.cache: handling error in Cache.Notify: cache-type=connect-ca-leaf
Jan 28 03:30:32 tf-srv-quirky-bassi consul[278885]: error=
Jan 28 03:30:32 tf-srv-quirky-bassi consul[278885]: | rpc error making call: error issuing cert: Error making API request.
Jan 28 03:30:32 tf-srv-quirky-bassi consul[278885]: |
Jan 28 03:30:32 tf-srv-quirky-bassi consul[278885]: | URL: PUT https://vault.service.consul:8200/v1/connect-intermediate-dc1/sign/leaf-cert
Jan 28 03:30:32 tf-srv-quirky-bassi consul[278885]: | Code: 403. Errors:
Jan 28 03:30:32 tf-srv-quirky-bassi consul[278885]: |
Jan 28 03:30:32 tf-srv-quirky-bassi consul[278885]: | * permission denied
Jan 28 03:30:32 tf-srv-quirky-bassi consul[278885]: index=425043
Jan 28 03:42:12 tf-srv-quirky-bassi consul[278885]: 2023-01-28T03:42:12.173Z [WARN] agent.cache: handling error in Cache.Notify: cache-type=connect-ca-leaf
Jan 28 03:42:12 tf-srv-quirky-bassi consul[278885]: error=
Jan 28 03:42:12 tf-srv-quirky-bassi consul[278885]: | rpc error making call: error issuing cert: Error making API request.
Jan 28 03:42:12 tf-srv-quirky-bassi consul[278885]: |
Jan 28 03:42:12 tf-srv-quirky-bassi consul[278885]: | URL: PUT https://vault.service.consul:8200/v1/connect-intermediate-dc1/sign/leaf-cert
Jan 28 03:42:12 tf-srv-quirky-bassi consul[278885]: | Code: 403. Errors:
Jan 28 03:42:12 tf-srv-quirky-bassi consul[278885]: |
Jan 28 03:42:12 tf-srv-quirky-bassi consul[278885]: | * permission denied
Jan 28 03:42:12 tf-srv-quirky-bassi consul[278885]: index=425043
Consul Vault provider config:
connect {
enabled = true
ca_provider = "vault"
ca_config {
address = "https://vault.service.consul:8200"
root_pki_path = "pki"
intermediate_pki_path = "connect-intermediate-dc1"
cert_file = "/opt/consul/tls/consul.crt.pem"
key_file = "/opt/consul/tls/consul.key.pem"
auth_method {
type = "cert"
params = {
name = "consul-cluster"
}
}
}
}
Vault consul-cluster
policy:
# Allow issueing certificates under the consul-cluster role
path "pki/issue/consul-cluster" {
capabilities = ["update"]
}
# Allow listing existing PKI mounts
path "/sys/mounts" {
capabilities = ["read"]
}
# Allow reading configuration of the PKI secrets engine
path "/sys/mounts/pki" {
capabilities = ["read"]
}
# Allow reading configuration of the Connect intermediate
path "/sys/mounts/connect-intermediate-dc1" {
capabilities = ["read"]
}
# Allow tuning configuration of the Connect intermediate
path "/sys/mounts/connect-intermediate-dc1/tune" {
capabilities = ["update"]
}
# Allow basic interaction with PKI secrets engine
path "/pki/" {
capabilities = ["read"]
}
# Allow signing intermediates
path "/pki/root/sign-intermediate" {
capabilities = ["update"]
}
# Allow all/any interactions with the Connect intermediate
path "/connect-intermediate-dc1/*" {
capabilities = ["create", "read", "update", "delete", "list"]
}
# Allow the renewal of own token
path "auth/token/renew-self" {
capabilities = ["update"]
}
# Allow looking up own token
path "auth/token/lookup-self" {
capabilities = ["read"]
}
Reproduction Steps
Steps to reproduce this issue, eg:
- Create a cluster with n client nodes n and n server nodes
- Run
curl ...
- View error
Consul info for both Client and Server
Client info
agent:
check_monitors = 0
check_ttls = 0
checks = 0
services = 0
build:
prerelease =
revision = bd257019
version = 1.14.3
version_metadata =
consul:
acl = disabled
bootstrap = false
known_datacenters = 1
leader = true
leader_addr = 10.18.248.75:8300
server = true
raft:
applied_index = 518042
commit_index = 518042
fsm_pending = 0
last_contact = 0
last_log_index = 518042
last_log_term = 11
last_snapshot_index = 508031
last_snapshot_term = 11
latest_configuration = [{Suffrage:Voter ID:8417e1d8-97d4-ab5e-c5c3-7a092a75b29f Address:10.18.30.85:8300} {Suffrage:Voter ID:e08569e7-2d8c-dfc2-9a43-95d5d4049a0f Address:10.18.248.75:8300}]
latest_configuration_index = 0
num_peers = 1
protocol_version = 3
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Leader
term = 11
runtime:
arch = amd64
cpu_count = 2
goroutines = 423
max_procs = 2
os = linux
version = go1.19.4
serf_lan:
coordinate_resets = 0
encrypted = false
event_queue = 0
event_time = 10
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 34956
members = 10
query_queue = 0
query_time = 1
serf_wan:
coordinate_resets = 0
encrypted = false
event_queue = 0
event_time = 1
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 4
members = 2
query_queue = 0
query_time = 1
Server info
Nomad server: v1.4.3
Operating system and Environment details
Linux 5.4.0-132-generic #148-Ubuntu SMP Mon Oct 17 16:02:06 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Hi @karelorigin,
Your Vault policy looks correct to me. It follows the suggestions in the documentation for "Vault managed PKI paths".
You mentioned that things start to break after 72 hours. Consul service mesh leaf certificates have a 72 hour TTL by default. The error message involving Vault API path /v1/connect-intermediate-dc1/sign/leaf-cert
suggests that Consul is attempting to generate a new leaf certificate to replace one approaching its 72 hour expiry, but that operation is failing. After 72 hours, the leaf certificate expires without having been replaced.
Are you observing any failed Vault API calls to auth/token/renew-self
in the logs? Or to auth/token/lookup-self
?
Hey @jkirschner-hashicorp,
Thanks for looking into this so quickly! :D. I've grepped through the logs for anything Vault related and didn't find any errors for /renew-self
or /lookup-self
. The only other error (that I hadn't spotted before) is a connect.ca.vault login error that seems to occur once a day.
Jan 29 13:22:02 tf-srv-xenodochial-wing consul[128357]: 2023-01-29T13:22:02.030Z [ERROR] connect.ca.vault: Error login in to Vault with %q auth method: EXTRA_VALUE_AT_END=cert
That could explain the problem. I suspect that it might be related to https://github.com/hashicorp/vault/issues/18562, which is Vault bug I recently discovered and reported. It would make sense for Consul to inherit the issue, since it relies on Vault libraries.
Note: The authentication certs are valid for 24 hours and renewed twice a day. Certificate validity is not the issue.
It looks like that Vault fix is intended to be released in Vault 1.14.0: https://github.com/hashicorp/vault/pull/19002#issuecomment-1479733111
Closing as resolved by hashicorp/vault#19002