grpc-go icon indicating copy to clipboard operation
grpc-go copied to clipboard

"GCE" metadata check is fragile

Open Capstan opened this issue 2 years ago • 11 comments

See also #3666

Google Container Engine, Google Cloud Run, Google App Engine, and possibly also Google Cloud Functions all provide the "GCE" metadata server. However, gRPC checks explicitly for exact string matches of Google or Google Compute Engine when running on linux by inspecting the contents of the /sys/class/dmi/id/product_name file.

The runtime contracts of none of the non-GCE execution platforms state that they will provide either of these values in /sys/class/dmi/id/product_name. See https://issuetracker.google.com/issues/158533357 for a specific case where it was intentionally changed by product decision, and then causes breakage in gRPC clients.

This issue is not specific to Go. There are equivalent issues in Java and probably every other auth library dealing with application default credentials which on GCP rely on the metadata servers. Given that the official API guidance is to rely on our client libraries, that the client libraries have baked in a narrow check makes it incredibly difficult to make these changes without pinning the product to legacy behavior, even if a PR ends up relaxing the check.

/cc @krishna-seerapu @steren

Capstan avatar May 10 '22 17:05 Capstan

Would using https://pkg.go.dev/cloud.google.com/go/compute/metadata#OnGCE solve the problem?

menghanl avatar May 10 '22 18:05 menghanl

@menghanl systemInfoSuggestsGCE uses the exact same logic. On the other hand if that were used, then a service could explicitly set then environment variable GCE_METADATA_HOST to the normal value 169.254.169.254, which would escape out of the testing logic. That said, the comments look like that environment variable is specific to the Go compute client, and might not be generally available across other language clients (or clearly, across all Go clients).

Capstan avatar May 10 '22 18:05 Capstan

Even if we did change the way that GCE detection is done, there are already a bunch of clients in the wild (e.g. from older gRPC releases) that depend on this.

So if an existing platform changes behavior and breaks it, we would still have problems - @Capstan has that been considered?. I.e. I'm wondering how changing this in new client releases would solve anything.

apolcyn avatar May 10 '22 19:05 apolcyn

@apolcyn It would not solve the breaking of extant services. It would however allow services to selectively change their behavior to an improved reporting model of what actual service is running. It remains, AFAIK, an open question about how to roll that out safely.

Capstan avatar May 10 '22 20:05 Capstan

In order to help client libraries know the runtime they are running on, Google wanted Google Cloud Run, Google App Engine and Google Cloud Functions to return more accurate product names at /sys/class/dmi/id/product_name. This change is now on hold due to downstream impact on grpc-go and gax-java.

My suggestion would be for grpc-go to relax this check and instead of looking for a hardcoded list of values, to only check that the product_name string starts with Google. After some time (months? years?), Google Cloud runtimes might be able to switch (carefully) to more accurate product names.

steren avatar May 11 '22 23:05 steren

#5348 changes OnGCE to check for the prefix. A separate issue should be filed to update pkg.go.dev/cloud.google.com/go/compute/metadata#OnGCE

We can switch to metadata.OnGCE() if necessary. The concern is that it pulls in a big google cloud dependency to gRPC's main module.

menghanl avatar May 12 '22 21:05 menghanl

Assigning to @menghanl since you have a PR open for this.

easwars avatar May 16 '22 17:05 easwars

A separate issue should be filed to update pkg.go.dev/cloud.google.com/go/compute/metadata#OnGCE

Filed https://github.com/googleapis/google-cloud-go/issues/6037

easwars avatar May 16 '22 18:05 easwars

I am not sure any changes should be made here without going to an update to aip.dev. This kind of change should be driven from a design that all of the languages we support can implement.

Update: I will bring this up at our auth meeting later this week.

codyoss avatar May 16 '22 18:05 codyoss

In that case, we will keep #5348 open until a decision is made.

menghanl avatar May 16 '22 18:05 menghanl

I did bring this up with relevant parties today, still more discussion to be had. I think we should wait for changes to official docs or until we have a plan forward before we update what BIOS values to check.

codyoss avatar May 17 '22 18:05 codyoss

Based on the updates on https://github.com/googleapis/google-cloud-go/issues/6037, I don't think we should do anything here until there is broader agreement.

dfawley avatar Nov 28 '22 18:11 dfawley