google-cloud-go auth: Sporadic errors refreshing token with workload identity federation

Client

aiplatform.NewPredictionClient

Environment

Linux on AWS EKS go 1.2.4

Code and Dependencies

package main

func main() {
  	return aiplatform.NewPredictionClient(ctx,
		option.WithCredentialsFile(clientConfigPath),
}

go.mod

module modname

go 1.24.0

require (
        cloud.google.com/go/aiplatform v1.69.0
        cloud.google.com/go/vertexai v0.13.3
...
        cloud.google.com/go v0.116.0 // indirect
        cloud.google.com/go/auth v0.12.1 // indirect
...
        google.golang.org/api v0.211.0
        google.golang.org/grpc v1.70.0
)

Expected behavior

We are using the vertexai client library from AWS, authenticating using workload identity federation via the option.WithCredentialsFile. This works great 99% of the time.

Actual behavior

Sporadically, we get errors like:

Google Vertex Error, reason=: transport: per-RPC creds failed due to error: credentials: status code 400: {"error":"invalid_grant","error_description":"ID Token issued at 1749625771 is stale to sign-in."}

We'll see one or more such errors all referring to the same timestamp, a little more than an hour after the indicated timestamp, but sometimes as much as 80 minutes after the timestamp.

We understand there is a refresh mechanism in place, so it shouldn't be an issue to re-use a given client over a long interval, but it seems like there are some scenarios where this refresh mechanism doesn't work.

Additional context

I found a few related issues: One, Two Searching for this error message turns up a lot of results about Firebase auth. Given that there are two auth libraries in play (cloud.google.com/go/auth and golang.org/x/oauth2 it's pretty hard to trace the code and figure out what might be going on. It doesn't seem related to workload identity federation in particular, more so the general token exchange/refresh mechanism in the client library.

On the GCP side, it's also hard to find any log info about these errors (if the requests are even making it to GCP?) Unclear whether to look in the project that houses the Vertex API or the project that houses the workload identity pool.

Jun 12 '25 16:06 patrickvinograd

Thank you for the report. I have a couple of clarifying questions:

Can you share more about what type of credential file this is. It sounds like you are using an External Account? If so what is the configured token lifetime in that file?
I agree that is sounds like our refreshing logic is not working for your use case for some reason. When you see these errors do they eventually heal by themselves or do you need restart jobs to get the credentials to refresh?

Given that there are two auth libraries in play

Given your error message and the state of the repo cloud.google.com/go/auth should be the auth library in play. I can see the error references the credentials package.

It doesn't seem related to workload identity federation in particular, more so the general token exchange/refresh mechanism in the client library.

I suspect this is somehow related to workload identity federation or else there would be a lot more reports like this. Also, this auth flow has some extra network hops and tokens in play so I am guessing there is some complexity there.

On the GCP side, it's also hard to find any log info about these errors (if the requests are even making it to GCP?) Unclear whether to look in the project that houses the Vertex API or the project that houses the workload identity pool.

If you updated your dependencies to a more recent release we have added in some logging support since the versions you are using. Make sure to check out the warning about turning on this logging though to make sure that would be okay in your environment: https://github.com/googleapis/google-cloud-go/blob/main/debug.md#requestresponse-logging

Jun 12 '25 17:06 codyoss

Correct, it's an external_account credential file, from an AWS EKS workload, using the recommended configuration. We don't set any token lifetime explicitly which I understand to mean it is using the default of one hour.

It's a little hard to say whether it is self healing or not, I'm still digging through logs to figure that out and correlating with when we redeploy or when new pods spin up. During the course of our day our continuous deployment means we rarely have a given deployment persist for more than an hour. So we tend to see this for async Vertex jobs that we run overnight where we are not re-deploying.

Thanks for the pointer on the logging, I'll see if we can update.

Jun 12 '25 18:06 patrickvinograd

Just for context there was a similar issue to this reported in the past, but it was thought to be fixed by: https://github.com/googleapis/google-cloud-go/pull/10920

Jun 13 '25 17:06 codyoss

Could you share a redacted version of the file you are authenticating with? The external account flows can go down many different routes based on which fields are present in that file. Or if you have a service contract feel free to open an issue with support to share more details and you could link to this issue for context.

Also cc @nbayati

Jun 13 '25 18:06 codyoss

I will open a support ticket and link to this issue. Thanks for the quick follow-up/feedback!

Jun 13 '25 19:06 patrickvinograd

I was able to add time-of-request logging to all our errored Vertex API requests, and late Friday/early Saturday a batch of asynchronous jobs again tried to use an expired token. In this instance, there were 111 failed requests, all associated with the same ID token "issued at" timestamp.

Google Vertex API Error: rpc error: code = Unauthenticated desc = transport: per-RPC creds failed due to error: credentials: status code 400: {"error":"invalid_grant","error_description":"ID Token issued at 1749859528 is stale to sign-in."}

1749859528 = Sat Jun 14 00:05:28 UTC 2025 or Fri Jun 13 17:05:28 PDT 2025 (subsequent logs in PDT)

The first of these errors happened at 18:07:21 PDT (1 hour + 2 minutes) The last happend at 23:07:24, so over 5 hours later!

After that they stop. There was no deployment at that point, the pod kept running the rest of the day. But I'm still figuring out if the errors stopped because the auth component self healed, or if the batch of async jobs completed, i.e. there were no more requests.

Jun 16 '25 15:06 patrickvinograd

@patrickvinograd

Just looking at the JSON config file you provided offline and inspecting the cloud.google.com/go/auth source visually,

It seems that the following block in credentials/internal/externalaccount.tokenProvider.Token could be a potential source of the problem.

This block raises an error which will later be discarded, if expires_in is 0.

	// The RFC8693 doesn't define the explicit 0 of "expires_in" field behavior.
	if stsResp.ExpiresIn <= 0 {
		return nil, fmt.Errorf("credentials: got invalid expiry from security token service")
	}

The stsResp var holds a credentials/internal/stsexchange.TokenResponse struct:

// TokenResponse is used to decode the remote server response during
// an oauth2 token exchange.
type TokenResponse struct {
	AccessToken     string `json:"access_token"`
	IssuedTokenType string `json:"issued_token_type"`
	TokenType       string `json:"token_type"`
	ExpiresIn       int    `json:"expires_in"`
	Scope           string `json:"scope"`
	RefreshToken    string `json:"refresh_token"`
}

So:

If the token service returns a token with expires_in: 0.

or

If the token service returns a token without expires_in and ExpiresIn therefore has the Go int default value of 0.

An error would be raised by externalaccount.tokenProvider.Token when the enclosing auth.cachedTokenProvider.tokenAsync attempts its async refresh:

func (c *cachedTokenProvider) tokenAsync(ctx context.Context) {
	fn := func() {
		c.mu.Lock()
		c.isRefreshRunning = true
		c.mu.Unlock()
		t, err := c.tp.Token(ctx)
		c.mu.Lock()
		defer c.mu.Unlock()
		c.isRefreshRunning = false
		if err != nil {
			// Discard errors from the non-blocking refresh, but prevent further
			// attempts.
			c.isRefreshErr = true
			return
		}

I'd have to look more closely at the logic, but I believe this failing async refresh would allow an expired token to remain in use with no other error raised?

Perhaps the discarded error could be surfaced by disabling the async refresh with the auth/credentials.DetectOptions.DisableAsyncRefresh flag?

Jun 20 '25 16:06 quartzmo

@quartzmo That logic jumped out at me as well. I agree that we could be experiencing silent failures during the attempt to proactively refresh the token. But the errors we are logging are all happening > 1 hour after the "issued at" time for the token, meaning shouldn't we be firmly in the tokenBlocking code path? i.e. even if we couldn't do an async refresh, shouldn't it do a blocking refresh of the next request.

The other thing I'm confused about is that it turns out that we are instantiating a new aiplatform.NewPredictionClient(ctx, option.WithCredentialsFile(clientConfigPath)) on each of the asynchronous jobs that we are running. I realize we don't have to, the PredictionClient is listead as being thread-safe and reusable. But for the moment, that's how it's implemented.

And yet, we see 100+ errors all mentioning the same "issued at" timestamp, across many seemingly independent jobs/clients that have started at various times. They are running in the context of the same go monoservice, so it's possible there's something shared across clients/transports, I just can't see it from clicking around the code in Github. Is this to be expected?

Jun 20 '25 20:06 patrickvinograd

It's also a little hard to follow how many separate tokens/refreshes are in play - it seems like internally its pulling the EKS service account token from the projected volume (assume there's no caching here), then there's a GCP STS token endpoint, and then is there some additional token exchange on top of that with workload federation being used?

Jun 20 '25 20:06 patrickvinograd

@patrickvinograd Thank you for the extra details and the good questions. What do you think about disabling the async refresh with the auth/credentials.DetectOptions.DisableAsyncRefresh flag, just to see if it does reveal any suppressed error?

Jun 20 '25 21:06 quartzmo

Open to trying that, I'm not clear on how I would supply that option, we're using

aiplatform.NewPredictionClient(ctx, option.WithCredentialsFile(clientConfigPath)

i.e. not relying on ADC. And I don't see a way to plumb through DetectOptions in any of the option.ClientOption variations.

Jun 20 '25 21:06 patrickvinograd

	// Use credentials.DetectDefault, but provide DetectOptions to specify the exact file.
	// Instead of letting ADC search the environment, you tell it precisely which file to use.
	// This gives you more control while still using the ADC authentication flow.
	creds, err := credentials.DetectDefault(&credentials.DetectOptions{
		CredentialsFile:     clientConfigPath,
		DisableAsyncRefresh: true,
	})
	if err != nil {
		log.Fatalf("Failed to detect credentials from file '%s': %v", clientConfigPath, err)
	}

	// Pass the credentials you loaded from file using option.WithAuthCredentials().
	// Note: While option.WithCredentialsFile(clientConfigPath) exists and would work,
	// using DetectDefault first allows for more flexibility and is a good pattern to know.
	client, err := aiplatform.NewPredictionClient(ctx, option.WithAuthCredentials(creds))

Jun 23 '25 15:06 quartzmo

Thanks, I'll try that out and see how it performs.

My team had one other finding which is that we are not calling Close on the aiplatform.PredictionClient. Do you think this could be causing/exacerbating the problem? Even though we instantiate separate PredictionClients for each job, there's clearly some sharing going on at the gtransport level since all the errors refer to the same token timestamp. Wondering if calling Close would at least narrow the scope of the expired tokens.

Jun 23 '25 16:06 patrickvinograd

I can't picture how two clients initialized as shown would share a Token, but there could be something I'm missing. I believe this is where the grpc credentials ultimately get set up.

https://github.com/googleapis/google-cloud-go/blob/auth/v0.16.2/auth/internal/transport/cba.go#L142

Jun 23 '25 19:06 quartzmo

I can't either, and yet we clearly saw 100+ requests with the identical issued at <timestamp> in the error. I've been up and down our own code and the google auth/transport code looking for anything shared but I don't see anything.

My only flail at explaining that at this point is - is there any chance that error message from my original issue report is referring to the EKS service account token? We have this set up using a projected service account token. All the instance of the client on a given host would be fetching the same token from the same projected volume, and if it somehow was not refreshing as expected...

But that only makes sense if that token would be presented to a Google endpoint that would return the "ID Token issued at 1749625771 is stale to sign-in." error, which I have no real visibility into. There are clearly multiple tokens in play, but I don't know which ones flow to which service, and which errors those services might generate.

I know this idea would absolve y'all of responsibility (unless there's a bug/caching in the lookup of the projected token, vs. its contents), so while it's tempting to point to that and peace out I'd really appreciate a close analysis of whether this makes sense or not. You have been super helpful with this issue!

Jun 23 '25 22:06 patrickvinograd

And I see fileSubjectProvider is ultimately just doing an io.ReadAll so it doesn't seem like there's anything stateful happening there.

Jun 23 '25 22:06 patrickvinograd

I'm adding logging of the projected token payload when we run into this error, so we'll hopefully be able to isolate it to that part of the system.

Jun 23 '25 23:06 patrickvinograd

I know this idea would absolve y'all of responsibility (unless there's a bug/caching in the lookup of the projected token, vs. its contents), so while it's tempting to point to that and peace out I'd really appreciate a close analysis of whether this makes sense or not.

I'll try to route your questions to someone who might know!

Jun 24 '25 15:06 quartzmo

I was able to simulate a stale k8s service account token. I saved off a k8s token and waited for it to expire. Then I pointed the workload identity client configuration at the saved token instead of the projected volume. If I invoke a request I indeed get {"error":"invalid_grant","error_description":"ID Token issued at 1750790933 is stale to sign-in."}

I feel fairly satisfied that this is the failure mode - it's consistent with the observed error, and it explains the identical issued-at timestamp across otherwise independent clients. Now of course, I have to figure out why k8s is occasionally handing out expired tokens. 😓

Jun 24 '25 21:06 patrickvinograd

I have to figure out why k8s is occasionally handing out expired tokens.

To clarify:

Is this k8s on AWS?
Is the k8s token the base token in this workflow?
Is the logic in this Auth library working correctly?

Jun 25 '25 15:06 quartzmo

Yes, AWS EKS.
Correct, it's a k8s service account volume token that is used as the basis of the token exchange, per the GCP workload identity federation configuration guide.
As far as I can tell, yes.

So I think we can close this issue.

But, leaving this as a breadcrumb for anybody who comes along later: here's one possible explanation for why k8s would be handing out expired tokens: https://github.com/kubernetes/kubernetes/issues/116481 - after a pod goes into terminating state, kubelet stops refreshing tokens. If you have a long terminationGracePeriod, say due to processing long-running data science jobs, then during that interval tokens will no longer be refreshed.

Jun 25 '25 16:06 patrickvinograd