gargle icon indicating copy to clipboard operation
gargle copied to clipboard

Increase GCE metadata timeout for GKE workload identity

Open jmcarp opened this issue 3 years ago • 3 comments

When authenticating via GKE workload identity, the GCE metadata server isn't available for the first few seconds of the pod's existence. Since gargle uses a default timeout of 0.8s for the GCE metadata server, authentication often fails with workload identity enabled if it's the first step of a command. I think it would be useful to either increase the default timeout in line with the GKE docs, or if that would cause other problems, include a note in the docs. Happy to submit a patch for either option.

jmcarp avatar Aug 10 '21 23:08 jmcarp

In principle, I am happy to use a "better" timeout. I went to

https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#troubleshoot-timeout

but I don't actually see any concrete proposal re: a new timeout. What did you have in mind when you said "increase the default timeout in line with the GKE docs"?

Is the other proposed workaround available to you? I mean this:

Alternatively, you can deploy an initContainer that waits until the GKE metadata server is ready before running the Pod's main container.

jennybc avatar Aug 17 '21 17:08 jennybc

What did you have in mind when you said "increase the default timeout in line with the GKE docs"?

Hm, I took a closer look at the docs and some of the first-party sdks, and I don't see a specific recommendation for a timeout. I assume the initContainer workaround would work, but it's been much easier for me to override gargle.gce.timeout.

Do you think a note about workload identity and timeouts in https://github.com/r-lib/gargle/blob/main/R/credentials_gce.R would be helpful? I think that would have saved me some time debugging the issue, since the error I got from bigrquery wasn't very helpful: Can't get Google credentials..

jmcarp avatar Nov 05 '21 03:11 jmcarp

@jmcarp actually, we ended up cooking up an idea in #195 that I think may help here: if the initial detect_gce call deteremines we're on GCE, then automatically bump up the GCE timeout (since users should be happy to wait).

Would that do the trick for you here?

craigcitro avatar Nov 09 '21 03:11 craigcitro

I'm working on GCE things and a gargle release, so this is an open invitation if anyone wants to post an update or even just indicate that they still care about this.

jennybc avatar Oct 24 '22 17:10 jennybc