vault icon indicating copy to clipboard operation
vault copied to clipboard

vault agent certificate renewal troubles

Open celesteking opened this issue 1 year ago • 6 comments

Have decided to replace our CA with Vault but I'm now facing 2 issues:

  1. Vault Agent won't reload the renewed certificate. There's an issue open but it says it was resolved. Apparently, it wasn't or this is a new issue. What's interesting is that it happens randomly and sporadically.
  2. ~~Vault Agent will DoS the server if it can't renew the certificate.~~

The setup over here is pretty straightforward: Vault pki is used to issue the certificates and Vault agent is used to TLS cert auth against Vault server to request a renewed certificate. It's a chicken-and-egg scenario where initial certificate is issued manually and then gets indefinitely renewed. It's a poor man's certbot alternative that was supposed to work beautifully but it doesn't.

Vault agent host:

Apr 09 08:58:33 host.net vault[25781]: 2024-04-09T08:58:33.501-0500 [INFO]  agent.auth.handler: renewed auth token
Apr 09 18:11:50 host.net vault[25781]: 2024-04-09T18:11:50.890-0500 [INFO]  agent.auth.handler: renewed auth token
Apr 10 02:07:37 host.net vault[25781]: 2024-04-10T02:07:37.389-0500 [WARN]  agent: vault.pki(pki/issue/endpoint-selfrenew->/var/lib/puppet/ssl/endpoint.pem): TTL "2160h0m0s" is longer than permitted maxTTL "720h0m0s", so maxTTL is being used
Apr 10 02:07:37 host.net vault[25781]: 2024-04-10T02:07:37.389-0500 [WARN]  agent: vault.pki(pki/issue/endpoint-selfrenew->/var/lib/puppet/ssl/endpoint.pem): TTL "2160h0m0s" is longer than permitted maxTTL "720h0m0s", so maxTTL is being used
Apr 10 02:07:37 host.net vault[25781]: 2024-04-10T02:07:37.434-0500 [INFO]  agent: (runner) rendered "(dynamic)" => "/var/lib/puppet/ssl/endpoint.pem"
Apr 10 03:25:08 host.net vault[25781]: 2024-04-10T03:25:08.252-0500 [INFO]  agent.auth.handler: renewed auth token
Apr 10 12:38:25 host.net vault[25781]: 2024-04-10T12:38:25.637-0500 [INFO]  agent.auth.handler: renewed auth token
Apr 10 23:58:33 host.net vault[25781]: 2024-04-10T23:58:33.798-0500 [INFO]  agent.auth.handler: lifetime watcher done channel triggered
  << around this time server's /var overflowed >>
Apr 10 23:58:33 host.net vault[25781]: 2024-04-10T23:58:33.814-0500 [INFO]  agent.auth.handler: authenticating
Apr 10 23:58:33 host.net vault[25781]: 2024-04-10T23:58:33.865-0500 [ERROR] agent.auth.handler: error authenticating:
Apr 10 23:58:33 host.net vault[25781]: error=
Apr 10 23:58:33 host.net vault[25781]: | Error making API request.
Apr 10 23:58:33 host.net vault[25781]: |
Apr 10 23:58:33 host.net vault[25781]: | URL: PUT https://vault.srv:8200/v1/auth/cert/login
Apr 10 23:58:33 host.net vault[25781]: | Code: 500. Errors:
Apr 10 23:58:33 host.net vault[25781]: |
Apr 10 23:58:33 host.net vault[25781]: | * internal error
Apr 10 23:58:33 host.net vault[25781]: backoff=980ms
Apr 10 23:58:34 host.net vault[25781]: 2024-04-10T23:58:34.849-0500 [INFO]  agent.auth.handler: authenticating
Apr 10 23:58:34 host.net vault[25781]: 2024-04-10T23:58:34.988-0500 [ERROR] agent.auth.handler: error authenticating:
Apr 10 23:58:34 host.net vault[25781]: error=
Apr 10 23:58:34 host.net vault[25781]: | Error making API request.
Apr 10 23:58:34 host.net vault[25781]: |
Apr 10 23:58:34 host.net vault[25781]: | URL: PUT https://vault.srv:8200/v1/auth/cert/login
Apr 10 23:58:34 host.net vault[25781]: | Code: 500. Errors:
Apr 10 23:58:34 host.net vault[25781]: |
Apr 10 23:58:34 host.net vault[25781]: | * internal error
Apr 10 23:58:34 host.net vault[25781]: backoff=980ms
Apr 10 23:58:36 host.net vault[25781]: 2024-04-10T23:58:36.590-0500 [INFO]  agent.auth.handler: authenticating
Apr 10 23:58:36 host.net vault[25781]: 2024-04-10T23:58:36.712-0500 [ERROR] agent.auth.handler: error authenticating:
Apr 10 23:58:36 host.net vault[25781]: error=
Apr 10 23:58:36 host.net vault[25781]: | Error making API request.
Apr 10 23:58:36 host.net vault[25781]: |
Apr 10 23:58:36 host.net vault[25781]: | URL: PUT https://vault.srv:8200/v1/auth/cert/login
Apr 10 23:58:36 host.net vault[25781]: | Code: 500. Errors:
Apr 10 23:58:36 host.net vault[25781]: |
Apr 10 23:58:36 host.net vault[25781]: | * internal error
Apr 10 23:58:36 host.net vault[25781]: backoff=1.6s
Apr 10 23:58:39 host.net vault[25781]: 2024-04-10T23:58:39.138-0500 [INFO]  agent.auth.handler: authenticating
Apr 10 23:58:39 host.net vault[25781]: 2024-04-10T23:58:39.279-0500 [ERROR] agent.auth.handler: error authenticating:
Apr 10 23:58:39 host.net vault[25781]: error=
Apr 10 23:58:39 host.net vault[25781]: | Error making API request.
Apr 10 23:58:39 host.net vault[25781]: |
Apr 10 23:58:39 host.net vault[25781]: | URL: PUT https://vault.srv:8200/v1/auth/cert/login
Apr 10 23:58:39 host.net vault[25781]: | Code: 500. Errors:
Apr 10 23:58:39 host.net vault[25781]: |
Apr 10 23:58:39 host.net vault[25781]: | * internal error
Apr 10 23:58:39 host.net vault[25781]: backoff=2.42s
... and so on each ~5min

But on Vault server (mind the TZ):

#... | jq [.time, .auth.metadata.common_name, .request.operation, .request.path, .error]' 
["2024-04-09T13:58:33.4973504Z","host.net","update","auth/token/renew-self",null]
["2024-04-09T23:11:50.889135802Z","host.net","update","auth/token/renew-self",null]
["2024-04-10T07:07:37.378823826Z","host.net","update","pki/issue/endpoint-selfrenew",null]
["2024-04-10T08:25:08.250091454Z","host.net","update","auth/token/renew-self",null]
["2024-04-10T17:38:25.635681677Z","host.net","update","auth/token/renew-self",null]
["2024-04-11T02:51:42.975081291Z","host.net","update","auth/token/renew-self","1 error occurred:\n\t* failed to renew entry: failed to verify client's certificate: x509: certificate has expired or is not yet valid: current time 2024-04-10T21:51:42-05:00 is after 2024-04-10T21:45:31Z\n\n"]
["2024-04-11T02:51:43.041792099Z","host.net","update","auth/token/renew-self","1 error occurred:\n\t* failed to renew entry: failed to verify client's certificate: x509: certificate has expired or is not yet valid: current time 2024-04-10T21:51:43-05:00 is after 2024-04-10T21:45:31Z\n\n"]
["2024-04-11T02:51:43.067654351Z","host.net","update","auth/token/renew-self","1 error occurred:\n\t* failed to renew entry: failed to verify client's certificate: x509: certificate has expired or is not yet valid: current time 2024-04-10T21:51:43-05:00 is after 2024-04-10T21:45:31Z\n\n"]
["2024-04-11T02:51:43.086646745Z","host.net","update","auth/token/renew-self","1 error occurred:\n\t* failed to renew entry: failed to verify client's certificate: x509: certificate has expired or is not yet valid: current time 2024-04-10T21:51:43-05:00 is after 2024-04-10T21:45:31Z\n\n"]
.............
["2024-04-11T03:14:00.809214144Z","host.net","update","auth/token/renew-self","1 error occurred:\n\t* failed to renew entry: failed to verify client's certificate: x509: certificate has expired or is not yet valid: current time 2024-04-10T22:14:00-05:00 is after 2>
["2024-04-11T03:14:00.829559363Z","host.net","update","auth/token/renew-self","1 error occurred:\n\t* failed to renew entry: failed to verify client's certificate: x509: certificate has expired or is not yet valid: current time 2024-04-10T22:14:00-05:00 is after 2>
["2024-04-11T03:14:00.848438908Z","host.net","update","auth/token/renew-self","1 error occurred:\n\t* failed to renew entry: failed to verify client's certificate: x509: certificate has expired or is not yet valid: current time 2024-04-10T22:14:00-05:00 is after 2>
["2024-04-11T03:14:00.868140783Z","host.net","update","auth/token/renew-self","1 error occurred:\n\t* failed to renew entry: failed to verify client's certificate: x509: certificate has expired or is not yet valid: current time 2024-04-10T22:14:00-05:00 is after 2>
["2024-04-11T03:14:00.888379511Z","host.net","update","auth/token/renew-self","1 error occurred:\n\t* failed to renew entry: failed to verify client's certificate: x509: certificate has expired or is not yet valid: current time 2024-04-10T22:14:00-05:00 is after 2>
["2024-04-11T03:14:00.909593973Z","host.net","update","auth/token/renew-self","1 error occurred:\n\t* failed to renew entry: failed to verify client's certificate: x509: certificate has expired or is not yet valid: current time 2024-04-10T22:14:00-05:00 is after 2>

It goes on and on, each 0.02 sec! As you see, it won't slow down, it hammers the server as much as it can and pollutes the audit log. This is a DoS in its pure form. That's just one host, imagine what happens when several agents decide go nuts.

disable_idle_connections = ["auto-auth", "templating"]
vault {
}
auto_auth {
  method {
    type   = "cert"
    config = {
      reload      = true
      client_cert = "/var/lib/puppet/ssl/endpoint.pem"
      client_key  = "/var/lib/puppet/ssl/endpoint.pem"
    }
  }
}
template_config {
  exit_on_retry_failure = true
}
template {
  error_on_missing_key = true
  create_dest_dirs     = false
  destination          = "/var/lib/puppet/ssl/endpoint.pem"
  contents             = <<-EOF
    {{- with pkiCert "pki/issue/endpoint-selfrenew" (print "common_name=" (env "HOSTNAME" | toLower ))  "ttl=90d" -}}
    {{- .Data.Cert }}
    {{- .Data.Key }}
    {{- end -}}
  EOF
}

# openssl x509 -dates  < /var/lib/puppet/ssl/endpoint.pem
notBefore=Apr 10 07:07:07 2024 GMT
notAfter=May 10 07:07:37 2024 GMT

I won't show server config as there's nothing interesting there, it's a dull almost empty config. v1.16.1 both on server and clients

Please help.

celesteking avatar Apr 11 '24 13:04 celesteking

Another point I'd like to bring up is the renewal trigger time. As per docs:

Vault Agent renews the certificate at half the lease duration of the original.

But it waited (2024-04-10T07:07:37) until almost expiration (2024-04-10T21:45:31Z) of the cert. As you see, they're issued for 1 month, so the cutoff time should've been 2024-03-25 or so.

celesteking avatar Apr 11 '24 13:04 celesteking

Do you get different behavior if changing template_config.lease_renewal_threshold (PR) from default 0.9 to, say, 0.5?

There is also template_config.static_secret_render_interval, but I am not sure if that is relevant here.

Legogris avatar Apr 16 '24 02:04 Legogris

That PR hasn't made into RPM yet thus I can't try. All this Vault PKI thing isn't Enterprise ready. It looks like there are no Enterprise customers who are using Vault in that fashion (private pki/certbot replacement) otherwise we would've seen tons of complaints.

As regarding the lease_renewal_threshold which you say is 90%, I wish it were 90%. Looking at logs, it stays around ~1% till cert expiration when the renewal kicks off -- that's way too close to the cutoff.

celesteking avatar Apr 17 '24 17:04 celesteking

Hi there! I just wanted to let you know that I linked this to a bug we have open in our backlog we're trying to prioritize soon. Renewal retries without backoff is obviously something we should address, and we're planning to and hoping to soon. I can't promise more than that at this point, but I wanted to let you know that this is on our mind.

VioletHynes avatar Apr 18 '24 13:04 VioletHynes

That's absolutely good news. Bullet 2) can be taken off the table then. But the first one is still relevant.

celesteking avatar Apr 19 '24 00:04 celesteking

Hey! Wanted to follow up that I've just merged https://github.com/hashicorp/vault/pull/26844 which should address the constant quick request issue. As to what's causing the 500s, unfortunately I can't speak to that and it will need to wait for further investigation. Still, I wanted to let you know that one half of this issue should be resolved now and is merged.

VioletHynes avatar May 09 '24 15:05 VioletHynes