test-infra icon indicating copy to clipboard operation
test-infra copied to clipboard

Migrate from Kubernetes External Secrets to ~External Secrets Operator~ CSI Driver

Open chaodaiG opened this issue 3 years ago • 29 comments

What would you like to be added:

  • Switch Kubernetes External Secrets defined under https://github.com/kubernetes/test-infra/tree/master/config/prow/cluster to External Secrets Operator, by following https://github.com/external-secrets/kubernetes-external-secrets/issues/864
  • Update instruction at https://github.com/kubernetes/test-infra/blob/master/prow/prow_secrets.md
  • Announce at https://github.com/kubernetes/test-infra/blob/master/prow/ANNOUNCEMENTS.md
  • Announce at [email protected]

Why is this needed:

As announced at https://github.com/external-secrets/kubernetes-external-secrets/issues/864, Kubernetes External Secret is under maintenance mode right now, the new recommendation is to migrate over to External Secrets Operator.

There hasn't been any plan of turning down Kubernetes External Secret, so we might be fine for a while, until it's either incompatible with upcoming kubernetes versions, or newer features/bug fixes are only available from External Secrets Operator.

chaodaiG avatar Jan 13 '22 18:01 chaodaiG

/sig testing

chaodaiG avatar Jan 13 '22 18:01 chaodaiG

Any thoughts on SealedSecret as an alternative? Seems more gitops friendly

howardjohn avatar Feb 22 '22 04:02 howardjohn

Any thoughts on SealedSecret as an alternative? Seems more gitops friendly

I can see that https://github.com/bitnami-labs/sealed-secrets is similar to KES(Kubernetes external secret) in terms of generating kubernetes secrets from a more secure custom resources, which is not the only purpose of KES.

The original purpose of KES was introduced to solve the problem:

  • Kubernetes secrets were manually applied to the cluster by kubectl apply from dev machine(s)
  • The secret might get lost if someone accidentally update/delete the value of the secret, or even the cluster was accidentally deleted

As KES syncs secrets from major secret manager providers into kubernetes cluster, so the recovery of kubernetes secret is as simple as re-applying ExternalSecret CR into the kubectl cluster, for example https://github.com/kubernetes/test-infra/blob/d075174d2b9bcbe5aac9391ff306426963d2a37d/config/prow/cluster/kubernetes_external_secrets.yaml#L4

In short, SealedSecret is probably not the best replacement for KES

chaodaiG avatar Feb 22 '22 15:02 chaodaiG

My think was that external secret is really only marginally different than a developer doing kubectl apply. Now they are just doing gcloud secret create - same opaqueness as apply it seems? With sealed secret the entire state lives in git

On Tue, Feb 22, 2022, 7:23 AM Chao Dai @.***> wrote:

Any thoughts on SealedSecret as an alternative? Seems more gitops friendly

I can see that https://github.com/bitnami-labs/sealed-secrets is similar to KES(Kubernetes external secret) in terms of generating kubernetes secrets from a more secure custom resources, which is not the only purpose of KES.

The original purpose of KES was introduced to solve the problem:

  • Kubernetes secrets were manually applied to the cluster by kubectl apply from dev machine(s)
  • The secret might get lost if someone accidentally update/delete the value of the secret, or even the cluster was accidentally deleted

As KES syncs secrets from major secret manager providers into kubernetes cluster, so the recovery of kubernetes secret is as simple as re-applying ExternalSecret CR into the kubectl cluster, for example https://github.com/kubernetes/test-infra/blob/d075174d2b9bcbe5aac9391ff306426963d2a37d/config/prow/cluster/kubernetes_external_secrets.yaml#L4

In short, SealedSecret is probably not the best replacement for KES

— Reply to this email directly, view it on GitHub https://github.com/kubernetes/test-infra/issues/24869#issuecomment-1047907850, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEYGXMCW37B2TE5SY53SFDU4OS7TANCNFSM5L4SVMCQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>

howardjohn avatar Feb 22 '22 15:02 howardjohn

Agree with you that both need a manual operation of either kubectl apply and gcloud secret create, the gitops side is pretty similar though, one is SealedSecret CR and the other is ExternalSecret CR, they both can live in git. However SealedSecret can not solve the problem of a user accidentally modifying the secret from the source(previous applied configuration from k8s somewhat helps in this case, but is not capable of recovering from kubectl delete SealedSecret, or even the cluster was accidentally deleted). Using KES reduces such risk level as GCP secret manager version controls secrets, so:

  • if someone accidentally changed the value in GCP the secrets values can still be recovered
  • if the cluster was accidentally deleted, secrets can still be recovered by applying git source controlled KES CR

chaodaiG avatar Feb 22 '22 15:02 chaodaiG

I feel like you could say the same about SealedSecret though...

  • if someone accidentally changed the value in ~GCP~K8s the secrets values can still be recovered (from git)
  • if the cluster was accidentally deleted, secrets can still be recovered by applying git source controlled ~KES~SealedSecret CR

Except for "cluster deleted" I guess you would need to keep the sealed secret keys (to decrypt if cluster is deleted) somewhere, so at some point you need to bootstrap...

Anyhow I have no strong agenda either way, just wanted to throw the idea out there

howardjohn avatar Feb 22 '22 16:02 howardjohn

Thank you @howardjohn , this is really great discussion!

I think I have misunderstood SealSecret to certain extent, now that with your explanation it's a bit more clear now. So SealedSecret:

  • Stores secrets as "plain text" in SealedSecret CR
  • The private key for decrypting the secrets is only available in k8s cluster
  • A secret can only be created by a user running kubeseal, which would use public keys from k8s cluster

This sounds pretty good, and I can see that other than the cluster being deleted scenario this is also pretty reliable. One thing not super clear from the documentation, is that when a user runs kubeseal <mysecret.json >mysealedsecret.json as of https://github.com/bitnami-labs/sealed-secrets#usage, kubeseal needs to fetch public key from the cluster, do you happen to know whether this was true, @howardjohn ?

chaodaiG avatar Feb 22 '22 16:02 chaodaiG

@chaodaiG I think you can fetch the pubkey once and store in git. Then a dev experience to add a secret or update would be kubeseal mysecret --key pubkey.crt > mysealedsecret.json. Then postsubmit job kubectl applys to cluster; dev never needs access to the cluster.

But one concern would be that it said the key expires are 30d... so that may not work. I do not have much practical experience with sealed secrets so not 100% sure

howardjohn avatar Feb 22 '22 16:02 howardjohn

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar May 23 '22 17:05 k8s-triage-robot

https://kubernetes.slack.com/archives/C09QZ4DQB/p1654433983124889 is one of the reasons why this should be prioritized. TLDR: syncing build clusters tokens into prow is now a crucial piece in prow working with build cluster, KES flakiness would break this and cause prow stop working with the build cluster

/remove-lifecycle stale

chaodaiG avatar Jun 06 '22 14:06 chaodaiG

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Sep 04 '22 15:09 k8s-triage-robot

/remove-lifecycle stale

chaodaiG avatar Sep 04 '22 15:09 chaodaiG

Build cluster token failed to sync issue happened again https://kubernetes.slack.com/archives/C7J9RP96G/p1667877096344719, this is not good.

/assign

chaodaiG avatar Nov 08 '22 15:11 chaodaiG

uh-oh. thanks @chaodaiG

dims avatar Nov 08 '22 17:11 dims

I don't want to derail / delay efforts going on in #27932, but has something like https://secrets-store-csi-driver.sigs.k8s.io/ been considered? We could use that with the GCP provider today and there's support for AWS, Azure, and Vault providers if we need to change.

Using the CSI driver + Google Secrets Manager (provider) would allow us to leverage Workload Identity for IAM secret access. I believe we'd also have better insight into access / auditing.

I know GCP costs are a concern, the pricing page indicates that it would be $0.06 per secret/per location and $0.03 per 10,000 access operations (and $0.05 per rotation). I don't think the costs would be astronomical, but it would be worth doing a closer inspection if we decided to pivot towards this solution.

I'm happy to demo / help move forward if we want to go that direction, however I understand the urgency and value the progress already made.

Edit: It looks like External Secrets Operator also allows us to use Secret Manager + WID if we'd like: https://external-secrets.io/v0.6.1/provider/google-secrets-manager/. I think it would come down to the easier to maintain, and more active (future-proof-ish?), solution.

jimangel avatar Nov 09 '22 17:11 jimangel

hi @jimangel , it's not a derail at all. iirc csi driver for GCP was in its super early release cycle when we decided to adopt Kubernetes External Secret. The proposal of transition from Kubernetes External Secret to External Secret Operator was pretty much a lazy action based on the recommendation from Kubernetes External Secret.

In terms of the cost, we don't have that many secrets and lots of access operations so I wouldn't be too worried about it.

I would be glad to take another look at that csi driver for GCP since it's ready now, will do a quick evaluation by myself in terms of operational and maintenance perspective, and will get your thoughts if there is any question come up.

chaodaiG avatar Nov 09 '22 17:11 chaodaiG

Had an extensive and wonderful offline discussion with @jimangel , and here is what we agreed on:

  • External Secret Operator works as a central proxy service, which uses a dedicated k8s cluster SA that is WI binded with a GCP SA, this GCP SA is given GCP secret manager permissions to all secrets that are used in the k8s cluster, and these secrets are synced one way into the k8s cluster, and all pods from the cluster are allowed to use any of these secrets as long as they are in the same namespace.
  • CSI driver works by using the authentication methods from the pods that need to mount the secrets. For GCP this means the workload identity binded cluster SA on the pods are used for authenticating with GCP secret manager.
  • Technically speaking CSI driver is more secure than External Secret Operator as a prowjob pod will only be able to use secrets that the SA is allowed to(we don't have a fine grained separation of different team using different SAs yet, so this is more like future proof)
  • Other than security boundaries, one benefit of using CSI driver is that it avoids a GCP SA from Prow service cluster be given GCP secret manager permissions in other projects, and as a result migrating or recovering would be much easier(there will be no IAM changes required from users projects)
  • One "downside", is that In terms of authentication it only supports workload identity for GCP, so jobs that are not using workload identity will not be able to use this feature
  • The other "downside/WAI", is that pod would failed to start when a secret is not available. This is expected for a Prowjob, and imo is even better than it failed due to using stale secrets that were synced from 7 days ago. For Prow services we will need to make sure that all the kubeconfig secrets are stored in the GCP project where Prow is in, to avoid the case where a user provided kubeconfig secrets being deleted in GCP causing Prow downtime.

With all those being said, I'm convinced that CSI driver is better suited for our use case. Kudos to @jimangel , thank you so much for the discussion, I felt I have learned a lot!

@BenTheElder @spiffxp @dims @ameukam @cjwagner , WDYT?

chaodaiG avatar Nov 10 '22 20:11 chaodaiG

Awesome write up @chaodaiG! Agreed, it was fun chatting.

One "downside", is that In terms of authentication it only supports workload identity for GCP, so jobs that are not using workload identity will not be able to use this feature

There are alternatives for authentication outlined here: https://github.com/GoogleCloudPlatform/secrets-store-csi-driver-provider-gcp/blob/main/docs/authentication.md but the general consensus is to use WI if at all possible.

jimangel avatar Nov 10 '22 20:11 jimangel

@chaodaiG @jimangel Nice! +100

dims avatar Nov 10 '22 20:11 dims

Sounds like a nice improvement to me!

cjwagner avatar Nov 10 '22 22:11 cjwagner

@chaodaiG @jimangel Nice idea!

Let's try it.

ameukam avatar Nov 11 '22 00:11 ameukam

@jimangel So if the secret is mounted as a volume in the pod, how this is isolated from the other pods running in the same node ?

ameukam avatar Nov 11 '22 00:11 ameukam

So if the secret is mounted as a volume in the pod, how this is isolated from the other pods running in the same node?

I believe the threat model is the same as before (or more secure). Access today is segmented by namespace (k8s "built-in" secrets). With the CSI driver, access is only permitted when all conditions are met:

  1. A namespace scoped SecretProviderClass (CRD) defining access exists (This directs the mount to the appropriate GCP project / secret).
  2. GCP IAM bindings to a Service Account / Workload Identity in GCP to access a specific secret resource(s).

NOTE: Any workload/job/pod in a shared k8s namespace could use the same service account to access the permitted secret(s)/SecretProviderClass. That should be no different than any pod, in the same namespace, accessing the same secret, today.

As far as what access pods on the same node have (isolation)... If any pod/actor can mount/escape a pod to access node-level storage layers; you are as screwed as you'd be if you were using Kubernetes "built-in" secrets. 😅

Let me know if I misunderstood what you're asking @ameukam!

Edit: There are a couple "security considerations" called out in the repo itself.

jimangel avatar Nov 11 '22 02:11 jimangel

Hey all! Checking in here, what would be the next steps @chaodaiG? Should we try a small-scale demo or is there somewhere to test?

jimangel avatar Jan 30 '23 16:01 jimangel

@cjwagner could you please take a look

chaodaiG avatar Jan 30 '23 17:01 chaodaiG

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Apr 30 '23 18:04 k8s-triage-robot

/help-wanted

michelle192837 avatar Sep 12 '23 23:09 michelle192837

This is open for contribution if anyone's willing to do so. (We do keep seeing infrequent errors or flakes that require KES to be restarted, so while it's not urgent it'd be helpful!).

michelle192837 avatar Oct 03 '23 16:10 michelle192837

@michelle192837 Assuming this need to be deployed on a Google-owned GKE cluster, one action would be to create a SA with workload identity so we can use it to retrieve the secrets from the secret manager. I think it only can be done by EngProd.

ameukam avatar Oct 03 '23 16:10 ameukam