flakiness in various CI jobs - error "invalid_grant: Invalid JWT Signature" in gcloud CLI during auth login
Example log: https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/111859/pull-kubernetes-e2e-gce-storage-slow/1559899720485179392/build-log.txt
This is happening a lot across a variety of CI jobs. See chatter on #testing-ops as well ( https://kubernetes.slack.com/archives/C7J9RP96G/p1660676173294389 )
I0817 13:47:34.328] Call: gcloud auth activate-service-account --key-file=/etc/service-account/service-account.json
W0817 13:47:34.969] ERROR: (gcloud.auth.activate-service-account) There was a problem refreshing your current auth tokens: ('invalid_grant: Invalid JWT Signature.', {'error': 'invalid_grant', 'error_description': 'Invalid JWT Signature.'})
W0817 13:47:34.969] Please run:
W0817 13:47:34.969]
W0817 13:47:34.969] $ gcloud auth login
W0817 13:47:34.969]
W0817 13:47:34.970] to obtain new credentials.
W0817 13:47:34.970]
W0817 13:47:34.970] If you have already logged in with a different account:
W0817 13:47:34.970]
W0817 13:47:34.970] $ gcloud config set account ACCOUNT
W0817 13:47:34.970]
W0817 13:47:34.970] to select an already authenticated account to use.
@dims: There are no sig labels on this issue. Please add an appropriate label by using one of the following commands:
/sig <group-name>/wg <group-name>/committee <group-name>
Please see the group list for a listing of the SIGs, working groups, and committees available.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
cc @hakman @tobiasgiese @bobbypage @chaodaiG @BenTheElder
Looks like when this happened way back https://github.com/kubernetes/test-infra/issues/9373 @fejta had to Replaced the service account
W0817 13:47:34.325] **************************************************************************
bootstrap.py is deprecated!
test-infra oncall does not support any job still using bootstrap.py.
Please migrate your job to podutils!
https://github.com/kubernetes/test-infra/blob/master/prow/pod-utilities.md
**************************************************************************
bootstrap.py was not equipped to work with workload identity, was long deprecated, should have been migrated to use pod utilities + workload identity, instructions to follow:
- https://github.com/kubernetes/test-infra/tree/master/workload-identity#migrate-prow-job-to-use-workload-identity
- https://gist.github.com/dims/c1296f8ed42238baea0a5fcae45f4cf4 from @dims
@chaodaiG it happens in non-bootstrap.py jobs too, example: https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/containerd_containerd/7304/pull-containerd-build/1559903840889737216/build-log.txt
'[' -z /etc/service-account/service-account.json ']'
++ gcloud auth activate-service-account --key-file /etc/service-account/service-account.json --project=k8s-cri-containerd
remove the preset-service-account label from the job should fix
@chaodaiG looks like there are tons of these jobs with that preset - https://cs.k8s.io/?q=preset-service-account&i=nope&files=&excludeFiles=&repos=kubernetes/test-infra
Let me start with just the ones in k8s-cri-containerd project used by containerd.
looks like there are tons of these jobs with that preset - https://cs.k8s.io/?q=preset-service-account&i=nope&files=&excludeFiles=&repos=kubernetes/test-infra
this is not surprising. Having someone to remember to manually rotate this every 80 days doesn't seem like a sustainable solution, so at this point I'm very curious to understand whether there is any job that has no choice but use this physical service account key file.
The second goal, is to figure out whether all these jobs are maintained or not
@chaodaiG #27161 didn't help :( made the problem worse - https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/containerd_containerd/7304/pull-containerd-build/1559968071509086208/build-log.txt
reverting now
I don't think I have access to this infra anymore (different team now / different internal groups membership), or fejta (different company).
bootstrap => decorated should really happen but it's a pretty large lift, it might be automate-able but i'm not sure anyone here has bandwidth.
Thanks @chaodaiG @dims.
/reopen
@dims: Reopened this issue.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
@dims , I think the bash bug is fixed above, could you try re-apply the prowjob PR again?
Cross-posting here:

link: https://kubernetes.slack.com/archives/C7J9RP96G/p1660763919844109?thread_ts=1660758182.628529&cid=C7J9RP96G
Documenting the GCP service account key info:
- The key belongs to [email protected], and private key id starts with "529d" and ends with "8a47".
- So far the key is known to be stored in 'k8s-prow' and 'k8s-prow-builds' cluster, as a secret named
service-account, the data isservice-account.json: <BASE64-ENCODED-KEY>undertest-podsnamespace.
I'll rotate at these two places for now. Please update on this issue if the key is also used somewhere else.
We are also experiencing this issue with the apidiff test on Cluster API Provider Azure as of yesterday. Here is the build log for reference.
manually rotated the secret:
- Created a new key for [email protected] in UI
- Run
k --context=k8s-prow-builds -ntest-pods create secret generic service-account --from-file=service-account.json=<DOWNLOADED_JSON_PATH> -oyaml --dry-run=client | k --context=k8s-prow-builds -ntest-pods apply -f - - Run
k --context=k8s-prow -ntest-pods create secret generic service-account --from-file=service-account.json=<DOWNLOADED_JSON_PATH> -oyaml --dry-run=client | k --context=k8s-prow -ntest-pods apply -f -
thanks @chaodaiG, please see https://github.com/kubernetes/test-infra/pull/27169 for the reverts
thank you @dims for playing with me all day long :)
shadowing what you were doing was good experience @chaodaiG !! appreciate it.
@chaodaiG
I'm very curious to understand whether there is any job that has no choice but use this physical service account key file.
https://cs.k8s.io/?q=E2E_GOOGLE_APPLICATION_CREDENTIALS&i=nope&files=&excludeFiles=&repos=
IIRC there are some number of e2e jobs that need to provide a service account key to a gce pd driver deployed to the cluster under test. The clusters these jobs stand up aren't guaranteed to be GKE clusters, so I'm not sure changing the gce pd driver deployment to use workload identity is an option.
From https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/blob/master/docs/kubernetes/user-guides/driver-install.md#install-driver:
The driver requires a service account that has the following permissions and roles to function properly:
compute.instances.get compute.instances.attachDisk compute.instances.detachDisk roles/compute.storageAdmin roles/iam.serviceAccountUser
Replacing use of a shared service account key would involve jobs having to run something like the driver's setup-project.sh script prior to launching tests, which means permission to create a service account and service account keys in each project. I think it's possible to provide jobs with this privilege via workload identity, but I forget if the churn/noise of key creation is the reason a shared account key was used in the first place.
cc @msau42 who I think is more familiar with this than I am
@spiffxp , that's good to know, thanks! My feeling is that we'll probably need to rotate the key for a while until csi driver team figured out a way of using something like workload identity.
Created https://github.com/kubernetes/test-infra/pull/27202 as a first step for easier key rotation. Once it is merged then secret rotating will become:
- Create a new key
- Upload to GCS secret manager
cc @mattcary
Sorry, I'm not following the suggested solution. These keys are for running tests in k8s-on-gce, so there is no workload identity.
Since this is testing, a workaround is to give all nodes in the cluster cloud-platform scope and running them as a service account with the rbac @spiffxp mentioned above. We use this internally as we've locked down key downloads for google devs.
Would that be reasonable? Note this means that any pod running in such a cluster can create/delete disks, etc. Since it's a testing cluster it's probably ok.
I think this may be a some amount of work, depending if kubetest2 has plumbing for node scopes & service accounts during cluster-up.
This just started happening again on 2022-11-16 - https://storage.googleapis.com/k8s-triage/index.html?pr=1&text=error%20during%20gcloud%20auth%20activate-service-account
Looks like it is failing ~20% of https://testgrid.k8s.io/google-gce#gce-containerd&width=20 runs
Are particular nodes hitting the issue? looks like all the jobs in https://testgrid.k8s.io/google-gce#gce-containerd&width=20 are running on gke-prow-e2-default-pool-bdc23de7 nodepool ... did that node pool change configuration / version / etc?
oh, looks like the credential just expired and needs rotating (xref https://github.com/kubernetes/test-infra/issues/27157#issuecomment-1218338365)
As mentioned last time, the secret rotation is a little less risky now. So steps:
- Create a new json key for [email protected]
- Create a new version of GCP secret
default-k8s-build-cluster-service-account-keyink8s-prow-buildsproject, the value is the json key content from step 1 - Wait a few seconds and the key is rotated
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
We're going to have this problem on a regular basis until we can migrate CI out of google.com, which won't be anytime this year given the kubernetes.io budget issues.
This appears to be happening again.
See: https://github.com/kubernetes/test-infra/issues/27157#issuecomment-1220982143 for why moving to podutils / workload identity isn't a workable answer.
[...] but I forget if the churn/noise of key creation is the reason a shared account key was used in the first place.
Yes, that's the driving reason. Creating a lot of keys was causing issues. E.G. It meant the driver tests were attempting to cleanup keys, and a bug caused the main CI key to be deleted, which was a fun day 🙃
https://github.com/kubernetes/test-infra/issues/27157#issuecomment-1318950082 has the hotfix approach, for someone with access.
Maybe we should just bring up clusters with the proper scoped access on all nodes.
The issue is not adding some new special permission in order to get the tests to run. The test is already running with sufficient permissions to create disks---it's creating a cluster after all.
The issue is just plumbing that permission through the k8s layer, which involves this sketchy SA key stuff.
Maybe we should remove the need for the SA key stuff and just give all nodes in the test cluster the permissive scope. Would that be easier in the long term?