test-infra flakiness in various CI jobs - error "invalid_grant: Invalid JWT Signature" in gcloud CLI during auth login

Example log: https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/111859/pull-kubernetes-e2e-gce-storage-slow/1559899720485179392/build-log.txt

This is happening a lot across a variety of CI jobs. See chatter on #testing-ops as well ( https://kubernetes.slack.com/archives/C7J9RP96G/p1660676173294389 )

I0817 13:47:34.328] Call:  gcloud auth activate-service-account --key-file=/etc/service-account/service-account.json
W0817 13:47:34.969] ERROR: (gcloud.auth.activate-service-account) There was a problem refreshing your current auth tokens: ('invalid_grant: Invalid JWT Signature.', {'error': 'invalid_grant', 'error_description': 'Invalid JWT Signature.'})
W0817 13:47:34.969] Please run:
W0817 13:47:34.969] 
W0817 13:47:34.969]   $ gcloud auth login
W0817 13:47:34.969] 
W0817 13:47:34.970] to obtain new credentials.
W0817 13:47:34.970] 
W0817 13:47:34.970] If you have already logged in with a different account:
W0817 13:47:34.970] 
W0817 13:47:34.970]     $ gcloud config set account ACCOUNT
W0817 13:47:34.970] 
W0817 13:47:34.970] to select an already authenticated account to use.

Aug 17 '22 15:08 dims

@dims: There are no sig labels on this issue. Please add an appropriate label by using one of the following commands:

/sig <group-name>
/wg <group-name>
/committee <group-name>

Please see the group list for a listing of the SIGs, working groups, and committees available.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Aug 17 '22 15:08 k8s-ci-robot

cc @hakman @tobiasgiese @bobbypage @chaodaiG @BenTheElder

Aug 17 '22 15:08 dims

Looks like when this happened way back https://github.com/kubernetes/test-infra/issues/9373 @fejta had to Replaced the service account

Aug 17 '22 15:08 dims

W0817 13:47:34.325] **************************************************************************
bootstrap.py is deprecated!
test-infra oncall does not support any job still using bootstrap.py.
Please migrate your job to podutils!
https://github.com/kubernetes/test-infra/blob/master/prow/pod-utilities.md
**************************************************************************

bootstrap.py was not equipped to work with workload identity, was long deprecated, should have been migrated to use pod utilities + workload identity, instructions to follow:

https://github.com/kubernetes/test-infra/tree/master/workload-identity#migrate-prow-job-to-use-workload-identity
https://gist.github.com/dims/c1296f8ed42238baea0a5fcae45f4cf4 from @dims

Aug 17 '22 17:08 chaodaiG

@chaodaiG it happens in non-bootstrap.py jobs too, example: https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/containerd_containerd/7304/pull-containerd-build/1559903840889737216/build-log.txt

Aug 17 '22 17:08 dims

'[' -z /etc/service-account/service-account.json ']'
++ gcloud auth activate-service-account --key-file /etc/service-account/service-account.json --project=k8s-cri-containerd

remove the preset-service-account label from the job should fix

Aug 17 '22 17:08 chaodaiG

@chaodaiG looks like there are tons of these jobs with that preset - https://cs.k8s.io/?q=preset-service-account&i=nope&files=&excludeFiles=&repos=kubernetes/test-infra

Let me start with just the ones in k8s-cri-containerd project used by containerd.

Aug 17 '22 17:08 dims

looks like there are tons of these jobs with that preset - https://cs.k8s.io/?q=preset-service-account&i=nope&files=&excludeFiles=&repos=kubernetes/test-infra

this is not surprising. Having someone to remember to manually rotate this every 80 days doesn't seem like a sustainable solution, so at this point I'm very curious to understand whether there is any job that has no choice but use this physical service account key file.

The second goal, is to figure out whether all these jobs are maintained or not

Aug 17 '22 18:08 chaodaiG

@chaodaiG #27161 didn't help :( made the problem worse - https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/containerd_containerd/7304/pull-containerd-build/1559968071509086208/build-log.txt

reverting now

Aug 17 '22 18:08 dims

I don't think I have access to this infra anymore (different team now / different internal groups membership), or fejta (different company).

bootstrap => decorated should really happen but it's a pretty large lift, it might be automate-able but i'm not sure anyone here has bandwidth.

Thanks @chaodaiG @dims.

Aug 17 '22 18:08 BenTheElder

/reopen

Aug 17 '22 18:08 dims

@dims: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Aug 17 '22 18:08 k8s-ci-robot

@dims , I think the bash bug is fixed above, could you try re-apply the prowjob PR again?

Aug 17 '22 19:08 chaodaiG

Cross-posting here: Screen Shot 2022-08-17 at 12 21 50 PM

link: https://kubernetes.slack.com/archives/C7J9RP96G/p1660763919844109?thread_ts=1660758182.628529&cid=C7J9RP96G

Aug 17 '22 19:08 chaodaiG

Documenting the GCP service account key info:

The key belongs to [email protected], and private key id starts with "529d" and ends with "8a47".
So far the key is known to be stored in 'k8s-prow' and 'k8s-prow-builds' cluster, as a secret named service-account, the data is service-account.json: <BASE64-ENCODED-KEY> under test-pods namespace.

I'll rotate at these two places for now. Please update on this issue if the key is also used somewhere else.

Aug 17 '22 19:08 chaodaiG

We are also experiencing this issue with the apidiff test on Cluster API Provider Azure as of yesterday. Here is the build log for reference.

Aug 17 '22 20:08 willie-yao

manually rotated the secret:

Created a new key for [email protected] in UI
Run k --context=k8s-prow-builds -ntest-pods create secret generic service-account --from-file=service-account.json=<DOWNLOADED_JSON_PATH> -oyaml --dry-run=client | k --context=k8s-prow-builds -ntest-pods apply -f -
Run k --context=k8s-prow -ntest-pods create secret generic service-account --from-file=service-account.json=<DOWNLOADED_JSON_PATH> -oyaml --dry-run=client | k --context=k8s-prow -ntest-pods apply -f -

Aug 17 '22 21:08 chaodaiG

thanks @chaodaiG, please see https://github.com/kubernetes/test-infra/pull/27169 for the reverts

Aug 17 '22 21:08 dims

thank you @dims for playing with me all day long :)

Aug 17 '22 21:08 chaodaiG

shadowing what you were doing was good experience @chaodaiG !! appreciate it.

Aug 17 '22 21:08 dims

@chaodaiG

I'm very curious to understand whether there is any job that has no choice but use this physical service account key file.

https://cs.k8s.io/?q=E2E_GOOGLE_APPLICATION_CREDENTIALS&i=nope&files=&excludeFiles=&repos=

IIRC there are some number of e2e jobs that need to provide a service account key to a gce pd driver deployed to the cluster under test. The clusters these jobs stand up aren't guaranteed to be GKE clusters, so I'm not sure changing the gce pd driver deployment to use workload identity is an option.

From https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/blob/master/docs/kubernetes/user-guides/driver-install.md#install-driver:

The driver requires a service account that has the following permissions and roles to function properly:
compute.instances.get
compute.instances.attachDisk
compute.instances.detachDisk
roles/compute.storageAdmin
roles/iam.serviceAccountUser

Replacing use of a shared service account key would involve jobs having to run something like the driver's setup-project.sh script prior to launching tests, which means permission to create a service account and service account keys in each project. I think it's possible to provide jobs with this privilege via workload identity, but I forget if the churn/noise of key creation is the reason a shared account key was used in the first place.

cc @msau42 who I think is more familiar with this than I am

Aug 19 '22 18:08 spiffxp

@spiffxp , that's good to know, thanks! My feeling is that we'll probably need to rotate the key for a while until csi driver team figured out a way of using something like workload identity.

Created https://github.com/kubernetes/test-infra/pull/27202 as a first step for easier key rotation. Once it is merged then secret rotating will become:

Create a new key
Upload to GCS secret manager

Aug 19 '22 23:08 chaodaiG

cc @mattcary

Aug 22 '22 18:08 msau42

Sorry, I'm not following the suggested solution. These keys are for running tests in k8s-on-gce, so there is no workload identity.

Since this is testing, a workaround is to give all nodes in the cluster cloud-platform scope and running them as a service account with the rbac @spiffxp mentioned above. We use this internally as we've locked down key downloads for google devs.

Would that be reasonable? Note this means that any pod running in such a cluster can create/delete disks, etc. Since it's a testing cluster it's probably ok.

I think this may be a some amount of work, depending if kubetest2 has plumbing for node scopes & service accounts during cluster-up.

Aug 30 '22 23:08 mattcary

This just started happening again on 2022-11-16 - https://storage.googleapis.com/k8s-triage/index.html?pr=1&text=error%20during%20gcloud%20auth%20activate-service-account

Looks like it is failing ~20% of https://testgrid.k8s.io/google-gce#gce-containerd&width=20 runs

Nov 17 '22 16:11 liggitt

Are particular nodes hitting the issue? looks like all the jobs in https://testgrid.k8s.io/google-gce#gce-containerd&width=20 are running on gke-prow-e2-default-pool-bdc23de7 nodepool ... did that node pool change configuration / version / etc?

Nov 17 '22 16:11 liggitt

oh, looks like the credential just expired and needs rotating (xref https://github.com/kubernetes/test-infra/issues/27157#issuecomment-1218338365)

Nov 17 '22 17:11 liggitt

As mentioned last time, the secret rotation is a little less risky now. So steps:

Create a new json key for [email protected]
Create a new version of GCP secret default-k8s-build-cluster-service-account-key in k8s-prow-builds project, the value is the json key content from step 1
Wait a few seconds and the key is rotated

Nov 17 '22 17:11 chaodaiG

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Feb 15 '23 18:02 k8s-triage-robot

/remove-lifecycle stale

We're going to have this problem on a regular basis until we can migrate CI out of google.com, which won't be anytime this year given the kubernetes.io budget issues.

This appears to be happening again.

See: https://github.com/kubernetes/test-infra/issues/27157#issuecomment-1220982143 for why moving to podutils / workload identity isn't a workable answer.

[...] but I forget if the churn/noise of key creation is the reason a shared account key was used in the first place.

Yes, that's the driving reason. Creating a lot of keys was causing issues. E.G. It meant the driver tests were attempting to cleanup keys, and a bug caused the main CI key to be deleted, which was a fun day 🙃

https://github.com/kubernetes/test-infra/issues/27157#issuecomment-1318950082 has the hotfix approach, for someone with access.

Feb 17 '23 07:02 BenTheElder

Maybe we should just bring up clusters with the proper scoped access on all nodes.

The issue is not adding some new special permission in order to get the tests to run. The test is already running with sufficient permissions to create disks---it's creating a cluster after all.

The issue is just plumbing that permission through the k8s layer, which involves this sketchy SA key stuff.

Maybe we should remove the need for the SA key stuff and just give all nodes in the test cluster the permissive scope. Would that be easier in the long term?

Feb 17 '23 18:02 mattcary