google-auth-library-nodejs Sometimes failing to retrieve Application Default Credentials on GKE Autopilot in googleapis auth library

Mirror question here in Stack Overflow: https://stackoverflow.com/questions/68854640/sometimes-failing-to-retrieve-application-default-credentials-on-gke-autopilot-i

Some pods in my GKE Autopilot cluster aren't able to grab the Application Default Credentials to call other GCP services.

I will apply a new deployment, and 1 or 2 out of the 3 pods won't be able to authenticate using the googleapis (google-auth-library) npm package (tried with version v73.0.0 and the latest v84.0.0).

I get:

Error: Could not load the default credentials. Browse to https://cloud.google.com/docs/authentication/getting-started for more information. at GoogleAuth.getApplicationDefaultAsync (/node_modules/google-auth-library/build/src/auth/googleauth.js:173:19)

I am using this code and retrying on failure:


const {google} = require('googleapis');

const setGoogleAuth = async () => {
    try {
        const auth = new google.auth.GoogleAuth({
            // Scopes can be specified either as an array or as a single, space-delimited string.
            scopes: ['https://www.googleapis.com/auth/cloud-platform'],
        });             
        
        // Acquire an auth client, and bind it to all future calls
        const authClient = await auth.getClient();
        google.options({auth: authClient});
    } catch (e) {
        console.error(e)
        
        //retry

        //sleep for 3 seconds
        await sleep(3000)
        

        await setGoogleAuth()
    }
    
}

Calling the metadata server manually from a troubled pod via curl --location --request POST 'http://metadata/computeMetadata/v1/instance/service-accounts/default/identity?audience=my-gcp-endpoint' \ --header 'Metadata-Flavor: Google' \ --data-raw '{}' returns a valid token

Sometimes killing the pod and having them recreated works (using minReplicas in Horizontal Pod Autoscaler) and sometimes no matter how many times I kill the troubled pods the issue persists. Other times, I'll redeploy and have no problems. The behaviour seems very non-deterministic.

I also tried running another node process inside the troubled pod, and did not get the error

Any help would be appreciated, thank you!

Possible related issues

https://github.com/googleapis/google-auth-library-nodejs/issues/798 https://github.com/googleapis/google-auth-library-nodejs/issues/786 https://github.com/googleapis/google-auth-library-nodejs/issues/526

Environment details

GKE Version: 1.20.8-gke.900
Node.js version: v10.24.1
npm version: 6.14.12
google-auth-library version: tried with v73.0.0 and v84.0.0 (latest)

Steps to reproduce

Difficult to reproduce as behaviour is non deterministic

Aug 20 '21 18:08 mikegin

@mikegin could you your cluster with the environment variable DEBUG_AUTH set to true, and provide and provide any additional error information ?

Aug 23 '21 18:08 bcoe

@bcoe

Here is some more output with the DEBUG_AUTH set to true

{ FetchError: network timeout at: http://metadata.google.internal./computeMetadata/v1/instance
    at Timeout.<anonymous> (/node_modules/node-fetch/lib/index.js:1448:13)
    at ontimeout (timers.js:436:11)
    at tryOnTimeout (timers.js:300:5)
    at listOnTimeout (timers.js:263:5)
    at Timer.processTimers (timers.js:223:10)
  message:
   'network timeout at: http://metadata.google.internal./computeMetadata/v1/instance',
  type: 'request-timeout',
  config:
   { url:
      'http://metadata.google.internal./computeMetadata/v1/instance',
     headers: { 'Metadata-Flavor': 'Google' },
     retryConfig:
      { noResponseRetries: 0,
        currentRetryAttempt: 0,
        retry: 3,
        httpMethodsToRetry: [Array],
        statusCodesToRetry: [Array] },
     responseType: 'text',
     timeout: 3000,
     paramsSerializer: [Function: paramsSerializer],
     validateStatus: [Function: validateStatus],
     method: 'GET' } }

Aug 24 '21 13:08 mikegin

@mikegin could you try setting DETECT_GCP_RETRIES=3, and see if this stops the behavior? This will hopefully give GKE Autopilot more time to initialize.

Sep 08 '21 15:09 bcoe

@bcoe for me it helped, thanks. 🎉 Our GKE is not in autopilot mode, I had the same issue in our staging environment, where everything is scaled down, on our production system fortunately I haven't had this issue. So my first thought was that it didn't get enough CPU, or something...

Sep 08 '21 18:09 bvarga

@bcoe I will try that. So far I found setting K_SERVICE=true (based on the code in the gcp-metadata library) has helped. I redeployed 10 times or so to test and haven't experienced the issue.

Sep 08 '21 19:09 mikegin

DETECT_GCP_RETRIES=3 also seems to work. Thanks @bcoe!

Sep 09 '21 15:09 mikegin

@mikegin @bvarga any thoughts on where we could document this setting, so it might have jumped out at you? Would you have noticed documentation calling out this setting in the README for this repo?

Sep 09 '21 22:09 bcoe

Adding a "Troubleshooting" section in the README with the issue and solution would be good. Even mentioning the DEBUG_AUTH would be great, as it helped with further diagnosis.

Sep 09 '21 22:09 mikegin

We also hit this today and spent quite a while chasing shadows before eventually finding this thread.

While documentation is good and an improvement...

As a Google Cloud customer, I would like the Google Cloud SDKs to work on Google Cloud by default. It is not ideal to have to find somewhat obscure env vars hidden in the READMEs of internal SDK libraries in order to make things work reliably.

It seems like there is already handling to speed up the metadata checks for those who are running the SDK outside of GCP, e.g. the failFast here. Perhaps defaulting to more tries would be OK?

Also, there are also some (apparently) GCE-specific checks that control the failFast behaviour, here, but from a quick test, the env vars they use don't seem to be present for GKE workloads (i.e. inside a pod). Maybe those could be extended in some way to detect GKE as well as GCE?

Sep 10 '21 16:09 jdleesmiller

@jdleesmiller the folks who maintain this auth library don't work directly with the folks who maintain GKE Autopilot (this is one of many GCP environments we make an effort to support, Cloud Run, Cloud Build, Cloud Functions, people running from their laptop, etc.). I'm going to reach out to someone on the GKE Autopilot team, and surface this issue with them.

I've seen these issues in the past in GKE environments; perhaps it relates to too many containers coming up at once and all hammering the metadata server? or, perhaps the metadata server isn't available for the first few seconds of a container being provisioned?

I'll see if we can figure out a workaround, with the team.

Sep 13 '21 22:09 bcoe

Many thanks @bcoe .

In case it helps, a bit more about our case:

Like https://github.com/googleapis/google-auth-library-nodejs/issues/1236#issuecomment-915457074, we are not actually on GKE Autopilot (just GKE Standard), but I think the common thread may be workload identity, which is enabled by default on GKE Autopilot and also for the service in which we saw this problem on GKE Standard.
The service in question was a redis-backed message queue worker (bull) --- with a non-empty queue, it starts up quite quickly, pulls its first message from redis (so the network must be up to at least the extent required for it to connect to redis) and then tries to connect to BigQuery using its service account, which fails with the above error. We think that is indeed racing somehow with the metadata server becoming available --- adding retries helped, as apparently did a 5s delay on startup.
There were only ~2 of these worker pods (one container each), and it was just those pods starting up at the time, so I don't think anything was hammering the metadata server, but if there are any metrics we should look at for that we can see what they say.

Thanks again!

Sep 14 '21 07:09 jdleesmiller

which GKE version were you using?

Oct 04 '21 17:10 zshihang

It looks like at the time we were on 1.18.20-gke.901.

Oct 05 '21 13:10 jdleesmiller

Thanks @bcoe. I was facing the same issue with a deployment in a standard GKE cluster. Setting DETECT_GCP_RETRIES=3 did work, but it would be great to have this documented somewhere easier to find, maybe in the GKE Workload Identity troubleshooting section?

Mar 16 '22 07:03 3sne

Want to give an update to this issue - this PR should resolve this issue, however we're syncing internally on its implementation and uniformity across languages: https://github.com/googleapis/gcp-metadata/pull/528

Oct 25 '22 19:10 danielbankhead

We've merged a fix! Happy to share this is fixed as of gcp-metadata v5.1.0, which should be automatically picked up by future installations of this library.

https://github.com/googleapis/gcp-metadata/pull/528

Dec 08 '22 01:12 danielbankhead