google-auth-library-nodejs
google-auth-library-nodejs copied to clipboard
Sometimes failing to retrieve Application Default Credentials on GKE Autopilot in googleapis auth library
Mirror question here in Stack Overflow: https://stackoverflow.com/questions/68854640/sometimes-failing-to-retrieve-application-default-credentials-on-gke-autopilot-i
Some pods in my GKE Autopilot cluster aren't able to grab the Application Default Credentials to call other GCP services.
I will apply a new deployment, and 1 or 2 out of the 3 pods won't be able to authenticate using the googleapis (google-auth-library) npm package (tried with version v73.0.0 and the latest v84.0.0).
I get:
Error: Could not load the default credentials. Browse to https://cloud.google.com/docs/authentication/getting-started for more information. at GoogleAuth.getApplicationDefaultAsync (/node_modules/google-auth-library/build/src/auth/googleauth.js:173:19)
I am using this code and retrying on failure:
const {google} = require('googleapis');
const setGoogleAuth = async () => {
try {
const auth = new google.auth.GoogleAuth({
// Scopes can be specified either as an array or as a single, space-delimited string.
scopes: ['https://www.googleapis.com/auth/cloud-platform'],
});
// Acquire an auth client, and bind it to all future calls
const authClient = await auth.getClient();
google.options({auth: authClient});
} catch (e) {
console.error(e)
//retry
//sleep for 3 seconds
await sleep(3000)
await setGoogleAuth()
}
}
Calling the metadata server manually from a troubled pod via curl --location --request POST 'http://metadata/computeMetadata/v1/instance/service-accounts/default/identity?audience=my-gcp-endpoint' \ --header 'Metadata-Flavor: Google' \ --data-raw '{}'
returns a valid token
Sometimes killing the pod and having them recreated works (using minReplicas in Horizontal Pod Autoscaler) and sometimes no matter how many times I kill the troubled pods the issue persists. Other times, I'll redeploy and have no problems. The behaviour seems very non-deterministic.
I also tried running another node process inside the troubled pod, and did not get the error
Any help would be appreciated, thank you!
Possible related issues
https://github.com/googleapis/google-auth-library-nodejs/issues/798 https://github.com/googleapis/google-auth-library-nodejs/issues/786 https://github.com/googleapis/google-auth-library-nodejs/issues/526
Environment details
- GKE Version: 1.20.8-gke.900
- Node.js version: v10.24.1
- npm version: 6.14.12
-
google-auth-library
version: tried with v73.0.0 and v84.0.0 (latest)
Steps to reproduce
- Difficult to reproduce as behaviour is non deterministic
@mikegin could you your cluster with the environment variable DEBUG_AUTH
set to true
, and provide and provide any additional error information ?
@bcoe
Here is some more output with the DEBUG_AUTH
set to true
{ FetchError: network timeout at: http://metadata.google.internal./computeMetadata/v1/instance
at Timeout.<anonymous> (/node_modules/node-fetch/lib/index.js:1448:13)
at ontimeout (timers.js:436:11)
at tryOnTimeout (timers.js:300:5)
at listOnTimeout (timers.js:263:5)
at Timer.processTimers (timers.js:223:10)
message:
'network timeout at: http://metadata.google.internal./computeMetadata/v1/instance',
type: 'request-timeout',
config:
{ url:
'http://metadata.google.internal./computeMetadata/v1/instance',
headers: { 'Metadata-Flavor': 'Google' },
retryConfig:
{ noResponseRetries: 0,
currentRetryAttempt: 0,
retry: 3,
httpMethodsToRetry: [Array],
statusCodesToRetry: [Array] },
responseType: 'text',
timeout: 3000,
paramsSerializer: [Function: paramsSerializer],
validateStatus: [Function: validateStatus],
method: 'GET' } }
@mikegin could you try setting DETECT_GCP_RETRIES=3
, and see if this stops the behavior? This will hopefully give GKE Autopilot more time to initialize.
@bcoe for me it helped, thanks. 🎉 Our GKE is not in autopilot mode, I had the same issue in our staging environment, where everything is scaled down, on our production system fortunately I haven't had this issue. So my first thought was that it didn't get enough CPU, or something...
@bcoe I will try that. So far I found setting K_SERVICE=true (based on the code in the gcp-metadata library) has helped. I redeployed 10 times or so to test and haven't experienced the issue.
DETECT_GCP_RETRIES=3
also seems to work. Thanks @bcoe!
@mikegin @bvarga any thoughts on where we could document this setting, so it might have jumped out at you? Would you have noticed documentation calling out this setting in the README for this repo?
Adding a "Troubleshooting" section in the README with the issue and solution would be good. Even mentioning the DEBUG_AUTH
would be great, as it helped with further diagnosis.
We also hit this today and spent quite a while chasing shadows before eventually finding this thread.
While documentation is good and an improvement...
As a Google Cloud customer, I would like the Google Cloud SDKs to work on Google Cloud by default. It is not ideal to have to find somewhat obscure env vars hidden in the READMEs of internal SDK libraries in order to make things work reliably.
It seems like there is already handling to speed up the metadata checks for those who are running the SDK outside of GCP, e.g. the failFast
here. Perhaps defaulting to more tries would be OK?
Also, there are also some (apparently) GCE-specific checks that control the failFast
behaviour, here, but from a quick test, the env vars they use don't seem to be present for GKE workloads (i.e. inside a pod). Maybe those could be extended in some way to detect GKE as well as GCE?
@jdleesmiller the folks who maintain this auth library don't work directly with the folks who maintain GKE Autopilot (this is one of many GCP environments we make an effort to support, Cloud Run, Cloud Build, Cloud Functions, people running from their laptop, etc.). I'm going to reach out to someone on the GKE Autopilot team, and surface this issue with them.
I've seen these issues in the past in GKE environments; perhaps it relates to too many containers coming up at once and all hammering the metadata server? or, perhaps the metadata server isn't available for the first few seconds of a container being provisioned?
I'll see if we can figure out a workaround, with the team.
Many thanks @bcoe .
In case it helps, a bit more about our case:
- Like https://github.com/googleapis/google-auth-library-nodejs/issues/1236#issuecomment-915457074, we are not actually on GKE Autopilot (just GKE Standard), but I think the common thread may be workload identity, which is enabled by default on GKE Autopilot and also for the service in which we saw this problem on GKE Standard.
- The service in question was a redis-backed message queue worker (bull) --- with a non-empty queue, it starts up quite quickly, pulls its first message from redis (so the network must be up to at least the extent required for it to connect to redis) and then tries to connect to BigQuery using its service account, which fails with the above error. We think that is indeed racing somehow with the metadata server becoming available --- adding retries helped, as apparently did a 5s delay on startup.
- There were only ~2 of these worker pods (one container each), and it was just those pods starting up at the time, so I don't think anything was hammering the metadata server, but if there are any metrics we should look at for that we can see what they say.
Thanks again!
which GKE version were you using?
It looks like at the time we were on 1.18.20-gke.901.
Thanks @bcoe. I was facing the same issue with a deployment in a standard GKE cluster. Setting DETECT_GCP_RETRIES=3
did work, but it would be great to have this documented somewhere easier to find, maybe in the GKE Workload Identity troubleshooting section?
Want to give an update to this issue - this PR should resolve this issue, however we're syncing internally on its implementation and uniformity across languages: https://github.com/googleapis/gcp-metadata/pull/528
We've merged a fix! Happy to share this is fixed as of gcp-metadata
v5.1.0, which should be automatically picked up by future installations of this library.
- https://github.com/googleapis/gcp-metadata/pull/528