vault-secrets-webhook icon indicating copy to clipboard operation
vault-secrets-webhook copied to clipboard

Intermittent Issue with Pod Mutation Using Vault Secrets Webhook on Spot Instances

Open aleksandrovpa opened this issue 1 year ago • 14 comments

Hello Vault Secrets Webhook Team,

I am currently using the Vault Secrets Webhook Helm chart version 1.19.0 for secret injection into pods. My setup, including the values.yaml, works well most of the time. Here's a snippet of my values.yaml for context:

certificate:
  generate: false
  useCertManager: true
replicaCount: 2
env:
  VAULT_ADDR: "https://vault.contoso.com"
resources:
  requests:
    cpu: 10m
    memory: 32Mi
  limits:
    cpu: 200m
    memory: 256Mi
rbac:
  authDelegatorRole:
    enabled: true
affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            app.kubernetes.io/name: vault-secrets-webhook
            app.kubernetes.io/instance: vault-secrets-webhook
        topologyKey: kubernetes.io/hostname

The issue I am encountering is somewhat intermittent and occurs in a specific scenario. The pods of the webhook are hosted on spot instances in my Kubernetes cluster. Sometimes, when a node gets terminated and pods (including one of the webhook pods) are rescheduled to a different node, other service pods that are also being rescheduled do not get their secrets injected (i.e., the copy-vault-env init container is not created). This happens even though the webhook is deployed with 2 replicas, and the second replica is in a healthy state.

This behavior is puzzling as I would expect the second, operational replica of the webhook to handle the mutation requests when the other is rescheduling. As a temporary workaround, I have placed the webhook on static nodes, but I would like to understand why this issue occurs with spot instances.

Any insights or suggestions on this matter would be greatly appreciated.

Thank you for your support and for the great work on this project

aleksandrovpa avatar Nov 17 '23 11:11 aleksandrovpa

We experience the same behavior. It happened under heavy load during nodes creation/decommission and pods migration. Our workaround is to set podsFailurePolicy: Fail and secretsFailurePolicy: Fail to avoid starting resources without injection, but this is not tested on production, so not sure how it will behave on real load.

johnny990 avatar Nov 17 '23 14:11 johnny990

We experience the same behavior. It happened under heavy load during nodes creation/decommission and pods migration. Our workaround is to set podsFailurePolicy: Fail and secretsFailurePolicy: Fail to avoid starting resources without injection, but this is not tested on production, so not sure how it will behave on real load.

thanks @johnny990 I'll try this workaround. But this is strange thing, cause I don't have a high load for sure...

aleksandrovpa avatar Nov 27 '23 22:11 aleksandrovpa

Hey @aleksandrovpa, we have experienced the same issue on a staging environment using Spot instances. However, our configuration is a bit different: we started by using certificate.generate: true, which is generated by a Helm function https://github.com/bank-vaults/vault-secrets-webhook/blob/main/deploy/charts/vault-secrets-webhook/templates/apiservice-webhook.yaml#L10C16-L10C29. Additionally, we also use ArgoCD to manage our Apps. We noticed that when a disruption occurs on K8s due to Spot instance recreation, ArgoCD application went on a 'Out of Sync' state, triggering a helm templateagain to validate what is in fact Out of Sync. With the execution of helm template, a new certificate is generated, creating a state where Pods have mounted the old certificate, and the MutatingWebhookConfiguration object has the new certificate. We solved this issue temporarily by setting a fixed certificate certificate.generate: false while we don't implement Cert Manager (which I believe should address this issue). But you've mentioned that you use Cert Manager and are having some issues.. Are you using ArgoCD or any GitOps tool to install the mutator?

felippe-mendonca avatar Jan 04 '24 20:01 felippe-mendonca

Hey @aleksandrovpa, we have experienced the same issue on a staging environment using Spot instances. However, our configuration is a bit different: we started by using certificate.generate: true, which is generated by a Helm function https://github.com/bank-vaults/vault-secrets-webhook/blob/main/deploy/charts/vault-secrets-webhook/templates/apiservice-webhook.yaml#L10C16-L10C29. Additionally, we also use ArgoCD to manage our Apps. We noticed that when a disruption occurs on K8s due to Spot instance recreation, ArgoCD application went on a 'Out of Sync' state, triggering a helm templateagain to validate what is in fact Out of Sync. With the execution of helm template, a new certificate is generated, creating a state where Pods have mounted the old certificate, and the MutatingWebhookConfiguration object has the new certificate. We solved this issue temporarily by setting a fixed certificate certificate.generate: false while we don't implement Cert Manager (which I believe should address this issue). But you've mentioned that you use Cert Manager and are having some issues.. Are you using ArgoCD or any GitOps tool to install the mutator?

Hi Felippe, yes we use cert-manager, but this issue is not related to certificates. The main problem is that second replica of vault webhook didn't mutate pods where I expected env injection. I solve problem by putting vault webhook replicas to static nodes Btw this is not high available solution and we still didn't get any answers from vault secrets webhook team...

aleksandrovpa avatar Jan 04 '24 21:01 aleksandrovpa

I believe the issue may not originate from the vsw, but it seems to be related to the Kubernetes Control Plane, possibly involving the Admission Controller or the internal network. It appears that the Admission Controller is directing requests to a vsw of Kind: Service, which is then routed to an unavailable endpoint (presumably evicted at the time).

However, I'm not certain how to verify this hypothesis or address the problem. I am using Google Kubernetes Engine (GKE) for this setup. Any guidance on how to investigate this further or potential solutions would be greatly appreciated.

aleksandrovpa avatar Jan 19 '24 10:01 aleksandrovpa

Thank you for your contribution! This issue has been automatically marked as stale because it has no recent activity in the last 60 days. It will be closed in 20 days, if no further activity occurs. If this issue is still relevant, please leave a comment to let us know, and the stale label will be automatically removed.

github-actions[bot] avatar Mar 24 '24 00:03 github-actions[bot]

We also stumbled across this issue and scheduled the webhook onto our master nodes as a workaround, which is still not bulletproof. Also we're too afraid of configuring the webhook with podsFailurePolicy: Fail as we had trouble in the past with other webhooks, which have this setting configured. I suspect the problem it could be related to this: https://github.com/kubernetes/kubernetes/issues/80313 Someone there also wrote:

Through log analysis we also see only one OPA pod being hit with the webhooks and the other one basically being ignored.

Dbzman avatar Apr 19 '24 08:04 Dbzman

Short update: It doesn't seem to be related too much to one of the webhooks not being available. We had the webhook running with 2 instances for the last 5 days and yesterday we encountered this issue again.

Dbzman avatar Apr 24 '24 09:04 Dbzman

I have a similar problem #405
I have one webhooks and I'm not sure if this is related to the migration of resources to other nodes. Most often, this problem occurs with a group of cronjobs that are deployed simultaneously.

efimenko-dmi avatar Apr 25 '24 18:04 efimenko-dmi

Same issue for us... Does anyone have a solution for this?

ksemele avatar May 08 '24 08:05 ksemele

Same issue for us too

Oriolemon avatar Jun 14 '24 15:06 Oriolemon

We are going to release all our projects soon, when that happens I will notify you, and please let us know if you still experience the same problem as stated in the this ticket.

Meanwhile, please tryout our new project: https://github.com/bank-vaults/secrets-webhook It is a more enhanced version of Vault-Secrets-Webhook. (NOTE: It should be 100% backwards compatible, with your current VSW config, but keep in mind that there are a few differences, make sure you inspect the examples.

csatib02 avatar Aug 18 '24 09:08 csatib02

Happens to us as well. We're on 1.19. It's pretty critical issue :(

mdczaplicki avatar Aug 23 '24 13:08 mdczaplicki

BTW, when we switched to the new version, the problem disappeared for us. For at least 2 months, we haven't encountered a single error. If possible, try to upgrade

helm_oci: enabled
...
ext_helm:
  repository: oci://ghcr.io/bank-vaults/helm-charts/vault-secrets-webhook
  version: 1.21.2
...
replicaCount: 1
metrics:
  enabled: true
podsFailurePolicy: "Fail"
secretsFailurePolicy: "Fail"
env:
  VAULT_CLIENT_TIMEOUT: "120s"
...

Oriolemon avatar Aug 23 '24 14:08 Oriolemon

Release cycle done, could you please verify the issue is still present?

csatib02 avatar Sep 29 '24 09:09 csatib02