vault icon indicating copy to clipboard operation
vault copied to clipboard

Agent hangs and doesn't refresh secret immediately

Open 7fELF opened this issue 2 years ago • 13 comments

Describe the bug

With renewable secrets with max_ttl set, sometimes the secrets expires while the agent is sleeping. The agents wakes up after a while (could be hours after) and realize the lease doesn't exists and then refreshes the secret.

To Reproduce

Server:

wget https://github.com/exoscale/vault-plugin-secrets-exoscale/releases/download/v0.2.3/vault-plugin-secrets-exoscale_0.2.3_linux_amd64.tar.gz

tar xvf vault-plugin-secrets-exoscale_0.2.3_linux_amd64.tar.gz vault-plugin-secrets-exoscale

rm vault-plugin-secrets-exoscale_0.2.3_linux_amd64.tar.gz

vault server -dev -dev-root-token-id=root -dev-plugin-dir=$PWD

Plugin:

export VAULT_ADDR='http://127.0.0.1:8200'
export EXOSCALE_API_KEY='EXOb82f047a77d6af2563d83f81'
export EXOSCALE_API_SECRET='(redacted)'
VAULT_PLUGIN_SUM="$(sha256sum vault-plugin-secrets-exoscale | cut -d " " -f 1)"

vault plugin register \
	-sha256="${VAULT_PLUGIN_SUM}" \
	-command=vault-plugin-secrets-exoscale \
	secret vault-plugin-secrets-exoscale

vault secrets enable -plugin-name="vault-plugin-secrets-exoscale" -path "exoscale" plugin

vault write exoscale/config/root         \
    root_api_key=${EXOSCALE_API_KEY}       \
    root_api_secret=${EXOSCALE_API_SECRET} \
    zone=ch-gva-2

 vault write exoscale/role/list-only \
	operations=list-zones \
    ttl=10m \
    max_ttl=30m

vault read exoscale/apikey/list-only
vault lease revoke exoscale/apikey/list-only/XXXX

vault policy write "readsecret"  - <<EOF
path "exoscale/apikey/list-only" {
  capabilities = ["read"]
}
EOF

vault auth enable approle
vault write auth/approle/role/my-role \
    secret_id_ttl=240h \
    token_num_uses=1000 \
    token_ttl=240h \
    token_max_ttl=240h \
    secret_id_num_uses=1000 \
    token_policies=readsecret

vault read auth/approle/role/my-role/role-id -format=json | jq .data.role_id -r > role-id
vault write -f auth/approle/role/my-role/secret-id -format=json | jq .data.secret_id -r > secret-id

agent.hcl:

## Server
vault {
  address = "http://localhost:8200"
}

listener "unix" {
  address      = "./vault-agent.socket"
  tls_disable  = true
  socket_mode  = "0600"
}

## Authentication
auto_auth {
    method {
        type      = "approle"
        config = {
            role_id_file_path = "./role-id"
            secret_id_file_path = "./secret-id"
            remove_secret_id_file_after_reading = false
        }
    }
}

cache {
    use_auto_auth_token = true
}


template {
    destination      = "/home/antoine/iam.env"
    perms            = "0640"

    contents = <<EOH
{{ with secret "exoscale/apikey/list-only" -}}
EXOSCALE_API_KEY='{{ .Data.api_key }}'
EXOSCALE_API_SECRET='(redacted)'
{{ end }}
EOH
    error_on_missing_key = true
}

Start the agent:

vault agent -config ./agent.hcl -log-level=trace

Wait (might take a few hours to trigger the issue):

2022-07-20T07:27:08.574+0200 [WARN] vault.read(exoscale/apikey/list-only): failed to renew: Error making API request.

URL: PUT http://127.0.0.1:8200/v1/sys/leases/renew
Code: 400. Errors:

* lease not found

Expected behavior The secret should be refreshed when we enter the 10-20% grace period.

Environment:

  • Vault Server Version (retrieve with vault status):
  • Vault CLI Version (retrieve with vault version): Vault v1.10.4 (6a1dde56c18c4a1be2756b931ce3c872d8ca5a76)
  • Server Operating System/Architecture: Linux 5.4.x / Ubuntu 20.04 (focal) / amd64

Vault server configuration file(s): not relevant (can reproduce with the server started in -dev mode without a config file)

Additional context

We use use_auto_auth_token = true which automatically enable the caching, I suspect this is related to LeaseCache.

From what I understand of the internal behavior of vault agent this is what happens (crafted example):

It is important to know that our plugin might not extend the lease to prevent it from being capped by CalculateTTL (max_ttl). The reason why we do this probably needs it's own issue, but the TLDR is that if we don't do that, the grace period ends up being 10 to 20% of the capped value (which is unpredictible). We want to be able to guarantee that it is at least 10% of the full TTL. This should not have any incidence on this issue for two reasons:

  • I believe the plugin should be able, at its discretion, to refuse to extend the lease.
  • In the scenario above the Consul Template wakes up at time 51h50m, long after the max_ttl, even if the plugin renewed the secret, it would have been capped and would have expired at to 48h0m0s.

7fELF avatar Jul 25 '22 14:07 7fELF

OK, did a lot of looking at this problem. The graph is great. It seems like the core of the problem is that the Cache isn't updating the TTL when pulling items out of the cache in checkCacheForRequest. Most of the chart can be ignored, but the transaction at 34h19m is critical. The Cache has pulled a lease with a 24 hour TTL ~17 hours earlier, but when the Template Agent requests from the Cache the Cache indicates that the TTL is still 24 hours. The correct response should be 24-17= 7 hours left on the TTL.

If I put {"lease_id": "foo", "renewable": true, "data": {"value": "foo"}, "lease_duration": 600} into the cache, and don't change the lease_duration when responding to cache hits, then downstream caches will have incorrect lease_durations.

I'm not sure if the graph is correct here. Trying to find underlying code that indicates this.

Is this avoidable for non-renewable secrets?

The problem is reduced, but still exists. Here's an example.

0 -> Pod A requests credentials not in the cache with a TTL of 2 time periods. 1 - > Pod B starts and requests credentials in the cache. Cache says that the credential has a TTL of 2 time periods. However, the effective TTL is 1 as 1 time period has expired. 2 -> Credential expires 3 -> Pod B wakes up late to discover it's credential already expired.

No renewals necessary. The cache is returning bad TTL information.

CacheMeta.Age

This actually contains the age of a cached response and should show up in the Age header. The receiving client should update the TTL based on this information...

Still looking for where this header is used.


Specific plugin logic aside, it's a major issue for me that agent doesn't provision a new secret during the "max ttl grace period".

If I see:

URL: PUT http://127.0.0.1:8200/v1/sys/leases/renew
Code: 400. Errors:

* lease not found

to me this means that Vault has revoked my lease, the credentials my application is using is no longer valid, and my application is dead for 5-10 minutes while the plugin instantiates new credentials (yes this plugin is very slow).

Perhaps if I make the secret non-renewable I will receive a new credential while the old credential is still valid.


the function of concern being doRenewWithOptions

Though when this does work correctly the logic comes from sdk/framework/lease.go#CalculateTTL.


Reading the diagram more it seems like the issue is that the sleep period is always determined by the TTL, but should instead be determined by the "effective max ttl" which is min(current_ttl_grace, max_ttl_grace).


Revising thinking, it seems like the core issue is that the Vault Agent Cache is not updating the TTL to match the true remaining TTL when serving responses. Perhaps this is just the diagram. Needs confirmation.

Freyert avatar Aug 24 '22 18:08 Freyert

I've been looking everywhere to see if Vault Agent uses CacheMeta.Age to calculate the effective TTL. I'm not seeing it anywhere...

vault/api/sys_leases.go#RenewWithContext is not cache aware.

❓ Does RenewWithContext need to be cache aware if it tracks the increment? Yes, because if RenewWithContext returns an incremented lease that has "aged", but the age is not reflected in the token the consumer will sleep for the incorrect duration.

Freyert avatar Aug 25 '22 14:08 Freyert

I found during the Send the lease_cache actually parses the secret which is an opportunity to correct the TTL based on age.

https://github.com/hashicorp/vault/blob/main/command/agent/cache/lease_cache.go#L307-L311

This occurs after we check the cache and return if we have a hit. Meaning this happens on a miss. https://github.com/hashicorp/vault/blob/main/command/agent/cache/lease_cache.go#L268-L275

If we adjust the ParseSecret here we can correct the TTL for the case where the local cache is cold, but the remote cache is hot and aged.

Freyert avatar Aug 26 '22 13:08 Freyert

Thanks for looking into this @Freyert!

Correcting the TTL before returning the cached value would indeed fix the issue. However, I see a potential problem with that: How will this impact de grace period in the LifetimeWatcher ?

We need the grace period to be a predictable amount of time (i.e. at least 10% of the TTL we set in the config). This allows the software using the secret to finish ongoing business with the old secret (e.g. ongoing multipart uploads).

I think the grace period should be calculated using the full TTL and then the amount of time to sleep should be calculated based on the corrected TTL.

That would require the LifetimeWatcher to be cache aware, which, as you noted previously, is not easy without a breaking change to client.Sys().Renew.

7fELF avatar Aug 26 '22 13:08 7fELF

@7fELF I was just coming into contact with this problem.

I like this idea. So it would be best if we were able to intercept the "Age" header in any request.

I was imagining embedding CacheMetadata into the secrets themselves...

Freyert avatar Aug 26 '22 13:08 Freyert

Seems like the whole things needs a larger rework.

Age header -> Not used at all Tokens -> TTL, but not aware of age.

WrappedSecrets do have creation info...

type SecretWrapInfo struct {
	Token           string    `json:"token"`
	Accessor        string    `json:"accessor"`
	TTL             int       `json:"ttl"`
	CreationTime    time.Time `json:"creation_time"`
	CreationPath    string    `json:"creation_path"`
	WrappedAccessor string    `json:"wrapped_accessor"`
}

I imagine most secrets have a creation time on them it's just that metadata is a map[string]interface{} so it's not documented in the type.

Freyert avatar Aug 26 '22 13:08 Freyert

looking at sys/leases

Read lease includes the issue_time and expire_time

{
  "id": "auth/token/create/25c75065466dfc5f920525feafe47502c4c9915c",
  "issue_time": "2017-04-30T10:18:11.228946471-04:00",
  "expire_time": "2017-04-30T11:18:11.228946708-04:00",
  "last_renewal_time": null,
  "renewable": true,
  "ttl": 3558
}

Whereas renew, does not:

{
  "lease_id": "aws/creds/deploy/abcd-1234...",
  "renewable": true,
  "lease_duration": 2764790
}

Breakthrough idea, the renew should have the actual secret with full lease information. Focus on including the renewal time in the response from renew.

The response comes from this method: https://github.com/hashicorp/vault/blob/fea1d13f8d78243fa3561dcfff2d1e4581a96f42/vault/expiration.go#L1185


The Issue Time of the renew should exist!!! But I think because core uses logical.Secret and Sys().Renew uses api.Secret

https://github.com/hashicorp/vault/blob/fea1d13f8d78243fa3561dcfff2d1e4581a96f42/vault/expiration.go#L1941-L1949

Freyert avatar Aug 26 '22 14:08 Freyert

@hsimon-hashicorp I think we're a little stuck here and this seems to be a high severity issue.

Without modifying the logicalReponse -> httpResponse data to include a "responseTime" indicating when the the response was handled (or something like this) we can't trust the cache to wake at the correct time to renew credentials.

The cache is essentially non-functional with just the TTL. The cache lifetimeWatcher needs the original time to establish a correct wake time.

Any chance we could get an expert in these systems to weigh in on good points to modify? Perhaps an expert would know a better point to modify without creating massive change.

Freyert avatar Aug 26 '22 15:08 Freyert

@7fELF I took your suggestions and went looking for a way to timestamp all the responses instead of working with the Age header.

https://github.com/hashicorp/vault/commit/45667354700429e554d096612b599395e3d516a1

Every response has a request_id, why not add a request_timestamp. I think the request_id is connected with the original request, so maybe the name should be response_timestamp or timestamp.

Freyert avatar Aug 26 '22 19:08 Freyert

With the Age header they're trying to follow HTTP Cache semantics https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Cache-Control

Freyert avatar Aug 26 '22 19:08 Freyert

To simplify the discussion so far I put together a flow diagram. There are 2 points where the Age header will be present and it currently is being ignored or there is no way to include the Age these are highlighted in orange.

graph LR
   subgraph initProc
    miss[cache miss] --> fwd[forward request]
    fwd --> fill[insert cache entry]:::aged
   end

   subgraph renewProc
     renew:::aged --> calc
     calc[calculate grace period] --> sleep 
   end

   fill --> renew
   sleep --> renew

   classDef aged fill:#f96;

Renew

The main issue with renew is that we use a function from the api package that is ignorant of the Age header.

There are worries that we need to know the original TTL + the Age in order to correctly calculate a grace period. https://github.com/hashicorp/vault/issues/16439#issuecomment-1228495899

The math for calculating sleep:

raw math jax

$$ remainingLease = \begin{cases} renewDisabled: loopStart + priorDuration - time.Now() \ renewalFail: loopStart + leaseDuration - time.Now() \ success: leaseDuration \end{cases} \

minDuration = \begin{cases} (leaseDuration > increment) \land (increment > 0): increment \ leaseDuration \end{cases} \

grace = \begin{cases} minDuration \leq 0: 0 \ 0.1 * leaseDuration + random \mod leaseDuration \end{cases} \

sleepDuration = (remainingLease * 2/3) + (1/3 * gracePeriod) \

sleepDuration = ((leaseDuration) * 2/3) + \ (1/3 * 1/10 * (leaseDuratione) + \ (random \mod (leaseDuration)) $$

If we modify the leaseDuration by subtracting Age.

$$ sleepDuration = ((leaseDuration - Age) * 2/3) + \ (1/3 * (1/10 * (leaseDuration - Age) + \ (random \mod (leaseDuration - Age))) $$

If we modify the TTL by the Age header. The Age value affects the equation too much. In the following example an age of 250 gives us a max sleepPeriod of 550. What we would want instead is for the max sleep period to be 733.33 - 250 = 483.

  Original Aged
leaseDuration 1000 750
Age 0 250
gracePeriod 200 150
sleepPeriod 733.3333333 550

What we really want is to just subtract the Age from the sleep period.

We need to modify type renewFunc(string, int) (*Secret, error) to be type renewFunc(string, int) (*SendResponse, error).

Grace Period Calculation

lifetime_watcher.go#calculateGrace uses a pretty simple calculation to identify the gracePeriod.

Paired with the grace period is the sleep duration. The sleep duration is 2/3 of the total lease + 1/3 of the grace period (lifetime_watcher.go#L356).

Fresh

This one is a bit easier because we have access to the Age header.

leas_cache.go#Send works entirely with raw requests and therefore has access to the age header.

The only change that would be required here is to modify the cache index's LastRenewed date to incorporate the Age.

Freyert avatar Aug 29 '22 12:08 Freyert

We've decided to refocus on developing a test first and correcting the functionality from there.

We plan to drive the issue by "pretending" to be a lifetime watcher that sleeps too long. In the end we'll have a primary cache, and a secondary cache that receives aged tokens.

  1. Fill the primary cache
  2. Wait X time and then make the same request against the secondary cache which will pull the aged credential from the primary cache.
  3. Wait X time until the credential expires.
    • Check the secondary cache to see if the lease 400s when attempting to renew the lease.

Vault Cache Test Plan

Freyert avatar Sep 01 '22 20:09 Freyert

Hey @Freyert - great work and effort thus far. How have you progressed here out of interest?

aphorise avatar Sep 07 '22 11:09 aphorise

@aphorise we had to suspend work on this because we decided to mitigate risks other ways.

However this problem really creeps me out.

Our biggest problem was that if we started writing a solution without tests it was hard to tell if we were making things better or worse.

We did a ton of research on testing, but didn't end up writing one because the setup was so onerous. The test methodology seems sound and is described here: https://github.com/hashicorp/vault/issues/16439#issuecomment-1234734726

I would propose:

A. throwing out the current PR B. starting with a test to validate that this is a problem. C. Deliver the fix.

Freyert avatar Nov 09 '22 17:11 Freyert

Now I'm reminded of the breaking change that we were facing:

I think the grace period should be calculated using the full TTL and then the amount of time to sleep should be calculated based on the corrected TTL.

That would require the LifetimeWatcher to be cache aware, which, as you noted previously, is not easy without a breaking change to client.Sys().Renew.

client.Sys().Renew() uses ParseSecret(r io.Reader) which parses the Request's body for the secret information and returns.

This function is used extensively and I'm sure outside plugins depend on the functionality. Therefore:

👉 modifying ParseSecret to include headers is out of the question.

Freyert avatar Nov 13 '22 14:11 Freyert

Alright so I've got two PRs now:

  1. https://github.com/hashicorp/vault/pull/17919 refactoring to make changes to the sleep calculation easier.
  2. https://github.com/hashicorp/vault/pull/17921 introducing a "backwards compatible" change to introduce "Age" to secrets and parse "Age" from responses.

Freyert avatar Nov 13 '22 14:11 Freyert

@hsimon-hashicorp is there a better way to approach this problem? Get some hashicorp engineers involved? It really causes havoc, mayhem, and prevents us from adopting Vault for credential management.

Freyert avatar Jan 10 '23 14:01 Freyert

@hsimon-hashicorp is there a better way to approach this problem? Get some hashicorp engineers involved? It really causes havoc, mayhem, and prevents us from adopting Vault for credential management.

Thanks for bumping this up again! I see you've put a lot of time and energy into researching this; I'll get it in front of some engineers ASAP.

heatherezell avatar Jan 10 '23 18:01 heatherezell

hey @Freyert any chance you can update the last few points on the related PR #17919 - it's almost there & can hopeful make it into a future release soon. Thanks again for all your hard work!

aphorise avatar Feb 07 '23 17:02 aphorise

@Freyert, thanks for getting this started and all the work in figuring out the issue! I know I'd initially given some feedback on one of the PRs, but we’re going to do some more work within our engineering team for this (making sure to credit you for your contribution).

peteski22 avatar Feb 07 '23 18:02 peteski22

Hey! I'll take a look. So glad to see you all picking this up!

Freyert avatar Feb 07 '23 19:02 Freyert

@Freyert your time and effort is much appreciated.

What would be really useful would be to see if we can add a test which reproduces/validates this behaviour before we make any changes (a separate PR). Do you think that's something you would be interested in doing? I don't want to move this internally when you're still working on it to help solve the problem.

Thanks again :)

peteski22 avatar Feb 07 '23 20:02 peteski22

@peteski22 we dropped the ticket a while back. Our test case would look something like https://github.com/hashicorp/vault/issues/16439#issuecomment-1234734726.

We won't work on it further. We found that it was very difficult to get a test harness set up. I think there are many ways to test this and our approach was very bulky and time consuming. Starting a vault server, and two agents in an integration test was a lot to manage.

If there's an easier way to get a SecretResponse with an Age header into a lifetime watcher I'd be interested in hearing about it. That way you could show the exact problem without all the overhead of setting up a server and two agents.

Freyert avatar Feb 07 '23 20:02 Freyert

@7fELF, @Freyert

Once again thanks for your patience and the time you've put into reporting the issue and trying to investigate it. We have also been spending some time digging into this further. We haven’t been able to reproduce manually, or with a test, the problematic behavior of Agent sleeping past the time when it needs to renew a secret.

The agent wakes up after a while (could be hours after) and realizes the lease doesn't exist and then refreshes the secret.

It might help to clarify some things about max TTL and Age first, just so we're all on the same page.

max_ttl: (We'll use the term 'token' and 'secret' interchangeably here)... When an operator/admin configures a max_ttl, that's the maximum ttl that can ever be reached for a lease, regardless of how many times a secret is renewed. Once that max_ttl is reached, even if the secret 'claims' to be renewable it can't be, and you'd have to create (re-lease) an entirely new secret.

Age (header): When using Vault Agent as a (caching) proxy to Vault, we cache the responses that Agent receives from Vault. When the response is returned to the calling client, Vault Agent will augment that response (where appropriate) with an Age header to let the client know how long that response has been in the cache. The ttl within the cached response isn't adjusted, and the client would need to decide how it wanted to handle the information it's been given in totality. We're wondering if this might have caused some confusion.

Having reviewed the thread and attempted to reproduce, we can’t really tell what the problem is. Perhaps the secret has reached its max TTL, can’t be renewed, but the age headers are difficult to interpret. Perhaps there’s a bug where under some circumstances renewal is failing. Maybe there’s a problem somewhere else.

Being able to either write a failing test or providing simpler reproduction steps would help us confirm specifically what the problematic behavior is. Without one of those, we’ve reached the limits of this investigation and will need to move on to other items. To @Freyert's point, we understand this isn't trivial - we weren't able to do this in a way that demonstrated unexpected behavior.

It seems like @7fELF and @Freyert have identified similar but slightly different issues, and keeping them separate from each other would help rather than hinder our future investigations.

To reduce confusion and to represent that we've reached the end of our current investigation, I'm going to close this issue, and ask for new GitHub issues reflecting specific problems with associated reproduction steps.

mladlow avatar Feb 10 '23 18:02 mladlow

Makes sense! I think the thread is really complicated and it may be better to start with the root of the issue and then work up.

Or just summarize all the work so far into a more clear description. I'll write one now.

Freyert avatar Feb 16 '23 19:02 Freyert