nomad icon indicating copy to clipboard operation
nomad copied to clipboard

Nomad gets credential from vault twice, exports the first as env variable, but only renews the second

Open nobert opened this issue 3 years ago • 0 comments
trafficstars

Nomad version

1.4.1

Operating system and Environment details

Debian Bullseye

Issue

We dynamically provision mysql credentials using Vault (v1.12.0). This generally works fine, but we've found that many allocations will have their mysql credentials expire exactly 4 hours (our credential TTL) after launch. We're still not sure of the root cause, but we've found the following:

Upon launch, an allocation will sometimes request a mysql credential from vault twice (or even 3 times) in <= 1 second. The first one is exported via an environment variable, which is ultimately what the task uses. However, the second credential is written to secrets.env without restarting the allocation (despite ChangeMode: restart), and that is ultimately the credential that nomad manages, by continuously renewing. Eventually, the first credential expires, and the task loses access to mysql.

We initially saw that nomad was writing secrets.env twice in quick succession on an allocation launch, but I expect that is a symptom, i.e. it is writing secrets.env because it requested and received a new mysql credential.

It is seemingly random. It can happen to seemingly any job (we have ~30 different jobs on this cluster), on any host (there are 4), but not 100% of the time.

We've also noticed a similar behaviour when the nomad client restarts, which we'd initially attributed to the bug described in #13198, but now we're not so sure.

I think this behaviour started, or perhaps accelerated, when we upgraded from 1.3.5 to 1.4.1. We do have several other clusters running 1.4.1, where the behaviour is either not happening, or is less noticeable because the clusters are much lower-use in terms of # of jobs and allocations, and how frequently new allocations are launched.

Reproduction steps

We can recreate the problem just by launching allocations, but I do not know how to recreate an environment in which this can happen

Expected Result

Nomad does not request the same mysql credential multiple times on launch, and as such does not lose track of one

Actual Result

Nomad does request the same mysql credential multiple times on launch, but only renews one, and as such the other(s) expire

Job file

I can provide a scrubbed version if necessary. They're rather large. 😄

Nomad Client logs

A new allocation started at 12:20 CET. It immediately requested the mysql credential twice, and rendered secrets.env twice:

2022-10-24T12:20:46.157+0200 [INFO]  agent: (runner) rendered "(dynamic)" => "/srv/nomad/alloc/4189c642-7dc1-fe04-7e57-219f23f59e3d/application-main/secrets/secrets.env"
2022-10-24T12:20:46.281+0200 [INFO]  agent: (runner) rendered "(dynamic)" => "/srv/nomad/alloc/4189c642-7dc1-fe04-7e57-219f23f59e3d/application-main/secrets/secrets.env"

Credentials from vault audit logs (times in UTC):

"time": "2022-10-24T10:20:46.16171291Z", "lease_id": "mysql/secrets/creds/application/ZyfJBT9EcRnn7KQP7eEh6cJq"
"time": "2022-10-24T10:20:46.284807826Z", "lease_id": "mysql/secrets/creds/application/EoHS80Hi5HKwdvtMyeMa4LgU"

Exactly 4 hours later, at 16:20:46, The first of those two leases expired:

Oct 24 16:20:46 vm0417 vault[11395]: 2022-10-24T16:20:46.215+0200 [INFO]  expiration: revoked lease: lease_id=mysql/secrets/creds/application/ZyfJBT9EcRnn7KQP7eEh6cJq

Meanwhile, nomad continued to renew the second credential every 3 hours (times in UTC):

"time": "2022-10-24T13:09:44.228902823Z",
"lease_id": "mysql/secrets/creds/application/EoHS80Hi5HKwdvtMyeMa4LgU"
"time": "2022-10-24T15:58:42.174947264Z",
"lease_id": "mysql/secrets/creds/application/EoHS80Hi5HKwdvtMyeMa4LgU"
"time": "2022-10-24T18:47:40.121136561Z",
"lease_id": "mysql/secrets/creds/application/EoHS80Hi5HKwdvtMyeMa4LgU"
"time": "2022-10-24T21:36:38.068447975Z",
"lease_id": "mysql/secrets/creds/application/EoHS80Hi5HKwdvtMyeMa4LgU"

The secrets.env file for this allocation contains with a credential which exists on the database server, but the environment variable for the running process (as seen via /proc/[pid]/environ) contains a credential which does not exist.

I do have trace logs available, but I'm not sure what to post from it. They're so verbose in some ways, e.g., every template evaluation is logged, while seemingly missing other things I want, e.g., a trigger or diff when the template is updated.

nobert avatar Oct 27 '22 10:10 nobert