grpc-java icon indicating copy to clipboard operation
grpc-java copied to clipboard

GRPC will keep retrying (until RPC timeout) if a user uses a bad service account key

Open dzou opened this issue 4 years ago • 31 comments

What version of gRPC-Java are you using?

1.27.2

What is your environment?

Linux, JDK8

What do you see?

When a user uses an invalid service account key (like one that was deleted), it is treated as an UNAVAILABLE error. Unavailable errors are interpreted as ones that should be retried by client libraries; consequently the application will attempt to retry this operation with invalid credentials until the RPC timeout is reached (usually 10 minutes).

What do you expect to see instead?

When a user uses an invalid service account key to authenticate with GRPC, it should yield an UNAUTHENTICATED error which indicates the operation should not be retried.

Steps to reproduce the bug

This is most easily reproduced through experimenting with the client libraries.

  1. Create a service account and service account key, download it.
  2. Delete the key entry in Cloud Console (so the downloaded key is no longer valid).
  3. Authenticate with some google service using that key. Here is an example using Pub/Sub like this.

Additional details:

  • Many users have reported being confused by this behavior since 2018: https://github.com/googleapis/google-cloud-java/issues/3573

Suggested fix:

The code that controls this behavior is in GoogleAuthLibraryCallCredentials. I don't think all IOExceptions should be retried though.

In this case, one can see that the exception has a .getCause() which is HttpResponseException and it has .getStatusCode() == 400 which indicates a bad request. This is the error thrown if the user provides an invalid service account key.

Would it be possible to modify it so that if it is an IOException, it will examine the getCause() of the exception and throw UNAUTHENTICATED if the cause is HttpResponseException with status code 400?

Example of exception that you see:

th [] threw exception [Request processing failed; nested exception is com.google.api.gax.rpc.UnavailableException: io.grpc.StatusRuntimeException: UNAVAILABLE: Credentials failed to obtain metadata] with root cause

com.google.api.client.http.HttpResponseException: 400 Bad Request
{"error":"invalid_grant","error_description":"Invalid JWT Signature."}
	at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1113) ~[google-http-client-1.34.2.jar:1.34.2]
	at com.google.auth.oauth2.ServiceAccountCredentials.refreshAccessToken(ServiceAccountCredentials.java:441) ~[google-auth-library-oauth2-http-0.20.0.jar:na]
	at com.google.auth.oauth2.OAuth2Credentials.refresh(OAuth2Credentials.java:157) ~[google-auth-library-oauth2-http-0.20.0.jar:na]
	at com.google.auth.oauth2.OAuth2Credentials.getRequestMetadata(OAuth2Credentials.java:145) ~[google-auth-library-oauth2-http-0.20.0.jar:na]
	at com.google.auth.oauth2.ServiceAccountCredentials.getRequestMetadata(ServiceAccountCredentials.java:603) ~[google-auth-library-oauth2-http-0.20.0.jar:na]
	at com.google.auth.Credentials.blockingGetToCallback(Credentials.java:112) ~[google-auth-library-credentials-0.20.0.jar:na]
	at com.google.auth.Credentials$1.run(Credentials.java:98) ~[google-auth-library-credentials-0.20.0.jar:na]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_181-google-v7]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_181-google-v7]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) ~[na:1.8.0_181-google-v7]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:295) ~[na:1.8.0_181-google-v7]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_181-google-v7]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_181-google-v7]
	at java.lang.Thread.run(Thread.java:748) [na:1.8.0_181-google-v7]

dzou avatar Mar 05 '20 19:03 dzou

@dzou Based on getCause() being 400/Bad Request I think it is a stretch to infer UNAUTHENTICATED as the root cause. However the JSON snippet in the status text does indicate an authentication issue and that could be potentially used to infer UNAUTHENTICATED if that is what we get in all such cases.

sanjaypujare avatar Mar 05 '20 21:03 sanjaypujare

I see, then forget about UNAUTHENTICATED; However I think 400/Bad Request should be an indication that the request should not be retried, no?

dzou avatar Mar 05 '20 21:03 dzou

That's debatable. Without more information I can see a "bad request" succeeding after retrying (some kind of a race condition or simply 400/Bad Request being the incorrect code returned from the server). So one could argue that it is safe to retry a bad request.

sanjaypujare avatar Mar 05 '20 21:03 sanjaypujare

Ah, so I am interpretting the 400 error-codes as an error being made on the caller's side - see this for what I had in mind: https://stackoverflow.com/questions/47680711/which-http-errors-should-never-trigger-an-automatic-retry

I was thinking basically that a 400-code would mean the user's request is incorrect (i.e. in this case making a request with a service account key that does not exist), and if we know the user's request is invalid then we need not retry it.

dzou avatar Mar 05 '20 21:03 dzou

Consider gRPC's retry feature as part of an advanced framework wrapping an existing protocol like HTTP (HTTP2 to be more precise). Some users would expect the framework to retry even in cases of 400/Bad Request just because the error code sounds like a catch-all generic error code. But I do see where you are coming from and if many others feel the same way I guess the change you want can be made. But it will have to be more than just mapping 400/Bad Request to UNAUTHENTICATED since UNAUTHENTICATED doesn't sound right.

sanjaypujare avatar Mar 05 '20 23:03 sanjaypujare

I agree that logic of onFailure inside GoogleAuthLibraryCallCredentials can be tweaked as follows: in case of IOException check if it is an com.google.api.client.http.HttpResponseException and if the status code says 401 (UNAUTHORIZED) then call applier.fail with Status.UNAUTHENTICATED similar to the current else clause. However that still requires modification in the google-auth-library to generate 401 instead of 400 in this case. Once that is done we can reopen this issue and someone can open a PR for GoogleAuthLibraryCallCredentials.

sanjaypujare avatar Mar 11 '20 18:03 sanjaypujare

@jiangtaoli2016, how do you think this should be fixed? It probably impacts all languages.

I think there are two things to discuss here

  1. Normally we auto-promote ServiceAccountCreds to ServiceAccountJwtAccessCredentials, which avoids exchanging a JWT for an OAuth token (via plain HTTP). It appears your system was not doing this exchange, probably because you specified scopes. If grpc-java ignored the scopes and auto-promoted in this case as well, it would have avoided your issue because the server would have responded with UNAUTHENTICATED. C core made this change earlier, but it was not made to grpc-java

  2. The status code to use when there is a failure getting an OAuth token. That used to be UNAUTHORIZED but was changed to UNAVAILABLE because the OAuth exchange can fail due to normal I/O problems. The HttpResponseException exception is very awkward for gRPC to dig into, as it is part of google-http-java-client which is an implementation detail of google-auth-library-java. It is unclear how this would be resolved; it would probably need enhancements to the google-auth-library-java API

A hacky "fix" for (2) is easy, but I don't think it puts us on solid ground and would just bite us in the future as an implementation change in google-auth-library-java could change the status code. And we'd need to figure out an equivalent solution in each language. Spending our time on (1) seems better, and while it does side-step the issue it also provides other benefits.

ejona86 avatar Mar 26 '20 04:03 ejona86

The token change from JWT server account creds to OAuth token is via https://accounts.google.com/o/oauth2/token.

@ejona86 Could you point the grpc core change that you refer? I think it makes sense to have all languages behave consistently.

jiangtaoli2016 avatar Mar 26 '20 06:03 jiangtaoli2016

@jiangtaoli2016, no, I can't point to the grpc core change. It was a long time ago and just mentioned to me "in person." I can't even point to the current behavior in c core.

ejona86 avatar Mar 26 '20 15:03 ejona86

Sorry for delay. I took some time to research cloud auth APIs and related gRPC codes.

When using service account key, one can either obtain an oauth2 access token from google token service (https://accounts.google.com/o/oauth2/token) and use the access token for Cloud APIs access or use the service account to create a JWT token and directly access Cloud APIs. The first method is ServiceAccountCredentials in Java or GoogleRefreshTokenCredentials in C++. The second method is ServiceAccountJwtAccessCredentials in Java or ServiceAccountJWTAccessCredentials. Unlike Java, gRPC core and wrapped language does not auto promote ServiceAccountCredentials into ServiceAccountJwtAccessCredentials. Both methods are valid and supported as of today. I don't think we could deprecate or should ServiceAccountCredentials.

In gRPC C++ GoogleRefreshTokenCredentials, AccessTokenCredentials, and StsCredentials, an access token is fetched remotely and attached to call metadata. In other words, it is a quite common scenario we have to deal with. @ejona86: it is better to deal with it rather than dodge this issue, as you suggested in (1) above.

In gRPC core implementation, to fetch an access token remotely, it uses grpc_httpcli_post to fetch token. By looking at grpc httpcli implementation, it does not seems retry. Basically, the fetching token is once. gRPC core does not parse response from HTTP request. Here is how it responds to http response. Any status code that is not 200 is treated as error. Unless success, IO error, auth error, or whatever error from token fetch will result in the same error message "Error occurred when fetching oauth2 token."

Back to (2) on @ejona86 suggestion, I agree it is hacky for gRPC code to parse into HttpResponseException exception from google-auth-library-java.

My suggestion would be doing something similar to c core: if fetching oauth2 token fails (due to various reasons), throw a generic error message such as "failed to fetch oauth2 token" and do not retry. Let application deal with it.

jiangtaoli2016 avatar Apr 19 '20 22:04 jiangtaoli2016

@jiangtaoli2016, grpc-java is not retrying. It is delivering an UNAVAILABLE error to the client. But the client (properly) has retry logic that retries UNAVAILABLE errors.

@jiangtaoli2016, how do you provide scopes for the oauth exchange in the C++ API? I thought scopes are required to obtain an oauth token.

ejona86 avatar Apr 21 '20 00:04 ejona86

@ejona86 I check all c core code. The only place that uses scopes is StsTokenFetcherCredentials.

jiangtaoli2016 avatar Apr 21 '20 15:04 jiangtaoli2016

@jiangtaoli2016, given that gRPC is not retrying and instead the application is retrying because gRPC returned UNAVAILABLE, are you suggesting gRPC continue returning UNAVAILABLE and just accept that the application will retry? The application will not have a good way of determining the HTTP failure was 400 Bad Request.

ejona86 avatar Jul 21 '20 16:07 ejona86

gRPC returning UNAVAILABLE is a fine choice for this particular use case. Downloading service account key is a bad choice, posing security risks to customers and to Google. Customers shall try to use alternative approaches, such as Application Default Credentials (ADC) or service account impersonation.

jiangtaoli2016 avatar Jul 21 '20 19:07 jiangtaoli2016

Hello, so to clarify, the position here is that this is working as intended?

I can accept this outcome because I don't feel strongly, but I would only say there are many legitimate use-cases for downloading service account keys still, such as using it in CI for integration tests against GCP services. From my experience based on user reports, the occurrence of this issue is still relatively often (though not very common).

dzou avatar Aug 31 '20 19:08 dzou

@jiangtaoli2016 @ejona86 My understanding is that the client libraries (GAPIC-based) retry automatically based on whether the code the gRPC returns is "retryable". UNAVAILABLE is clearly an error that should be retried. However, invalid credentials should not be retried.

Regarding the use of service account keys, agreed, it's not the best choice from a security point of view when other options like ADC or impersonation are available. However, there are certain APIs like Cloud Translation that do not work with ADC and explicitly recommend downloading and using service account keys.

meltsufin avatar Sep 21 '20 15:09 meltsufin

@jiangtaoli2016, what was the verdict here? Is there something to be changed, or is this working as intended and appropriate?

ejona86 avatar Jan 13 '21 00:01 ejona86

I would treat as working as intended.

@meltsufin Have you try to use Secure Token Service? That is a better alternative than service account key.

jiangtaoli2016 avatar Jan 13 '21 00:01 jiangtaoli2016

@jiangtaoli2016 I haven't used Secure Token Service, and I haven't seen it documented on product pages of any of the libraries. The issue is really not for any particular use case I might have, but for the users of our client libraries. The use of service account keys is recommended all over our documentation. Is this changing?

meltsufin avatar Jan 13 '21 02:01 meltsufin

@jiangtaoli2016, STS appears to be beta.

ejona86 avatar Jan 14 '21 00:01 ejona86

I just marked https://github.com/grpc/grpc-java/issues/8003 as a duplicate, but it did nicely paint out how the components fit together. It also referenced https://github.com/googleapis/gax-java/issues/965

ejona86 avatar Mar 25 '21 21:03 ejona86

Can we change the error code for this scenario to UNAUTHENTICATED?

In Google Ads we authenticate arbitrary users and the product does not support service account impersonation. When access tokens expire or are revoked our developers are getting stuck in a nasty retry loop (due to GAPIC retry policy as mentioned in googleapis/gax-java#965).

nwbirnie avatar Mar 31 '21 12:03 nwbirnie

As @ejona86 pointed out in https://github.com/grpc/grpc-java/issues/6808#issuecomment-604222507, my biggest concern is to have gRPC look into the HttpResponseException, which is an implementation detail of google-auth-library-java. Any chance to make an enhancement in upstream?

voidzcy avatar Mar 31 '21 18:03 voidzcy

Fixing the interface in upstream SGTM. Is there a bug opened already for this?

nwbirnie avatar Apr 06 '21 10:04 nwbirnie

Fixing the interface in upstream SGTM. Is there a bug opened already for this?

Nope, could you open it and link to this issue?

voidzcy avatar Apr 06 '21 17:04 voidzcy

A design is in-progress in goolge-auth-library-java to give gRPC the information it needs to choose the Status code appropriately (retriable vs non-retriable). The grpc-java changes are expected to be small. I'm very glad there's been recent movement here.

ejona86 avatar Sep 07 '21 20:09 ejona86

Is there any information about progress?

Gwen2 avatar Mar 31 '22 11:03 Gwen2

https://github.com/googleapis/google-auth-library-java/pull/750 provides a method to distinguish between the two cases. gRPC will need to be updated to consume the new Retryable interface:

  • Handle response as Unavailable when Auth library returns exception that implements the Retryable interface and Retryable.isRetryable() returns true
  • Handle response as Unauthenticated otherwise.

ejona86 avatar Mar 31 '22 14:03 ejona86

#9102 was reverted in #9118

ejona86 avatar May 16 '22 18:05 ejona86

@ejona86, the commit 372a535 from https://github.com/grpc/grpc-java/pull/10909 on master is tagged 1.63 and this issue is closed in the 1.63 milestone.

But looking at the 1.63 tarball or the released jar, the code changes from that commit aren't there.

I see now that this commit was partially reverted on the 1.63.x branch: https://github.com/grpc/grpc-java/pull/11056

Will this issue be reopened?

jacob-tolar-bridg avatar Apr 16 '24 21:04 jacob-tolar-bridg