grpc-java
grpc-java copied to clipboard
GRPC+ALTS client connection exception with Workload Identity
What version of gRPC-Java are you using?
I am using the Google BOM in Maven.
<dependency>
<groupId>com.google.cloud</groupId>
<artifactId>libraries-bom</artifactId>
<version>25.1.0</version>
<type>pom</type>
<scope>import</scope>
</dependency>
This contains GRPC 1.45.0
What is your environment?
Docker container running as GKE on GCP. Image from google/cloud-sdk:latest. Java 17. Google Endpoints is on the classpath running in Jetty, fyi. Worth mentioning GCP is using Workload Identity, so the service account is annotated so as to be managed by Google.
What did you expect to see?
I am testing the use of GRPC+ALTS for use in connecting an old service to another service. I would expect to see the call hit the server and return a result. I will add that if I use an unsecured ManagedChannelBuilder then the call works fine.
client (example):
var channel = AltsChannelBuilder.forAddress("my-service", 6565).addTargetServiceAccount("expected-service-account") .build();
var myService = MyServiceGrpc.newBlockingStub(channel);
var result = myService.getValue(MyServiceInput.newBuilder().setKey("key").build());
What did you see instead?
I see an "unknown" exception thrown by the client, "io.grpc.StatusRuntimeException: UNKNOWN: Channel Pipeline: [TsiHandshakeHandler#0, WriteBufferingAndExceptionHandler#0, DefaultChannelPipeline$TailContext#0]"
I spoke with "Eric Anderson" on Gitter. He suggested I file this as an issue.
Steps to reproduce the bug
Error: "io.grpc.StatusRuntimeException: UNKNOWN: Channel Pipeline: [TsiHandshakeHandler#0, WriteBufferingAndExceptionHandler#0, DefaultChannelPipeline$TailContext#0]"
It seems the TsiHandshaker isn't handling an exception properly, as otherwise we wouldn't get the ugly UNKNOWN on client-side. In the server logs we see "StatusRuntimeException: UNAVAILABLE: io exception", which seems likely to mean the connection to the metadata server was severed. Although why that would happen isn't immediately clear to me.
Seems like it is consistent, so it is probably worthwhile to set up a reproduction environment.
Oh, we see "Received a terminating error" in the log, which means this code is being triggered which confirms that the failure is with the metadata server: https://github.com/grpc/grpc-java/blob/d4fa0ecc07495097453b0a2848765f076b9e714c/alts/src/main/java/io/grpc/alts/internal/AltsHandshakerStub.java#L112-L114
Also, it explains why I don't see the actual IOException that caused the failure, because the code is written to only propagate strings and it only collected the message of the exception before throwing it away.
Any new with this? I would be happy to assist since I'm the one with the broken env??
Hi @wjohnst3
I think our first step is to fix the error message. I will help doing that, and let's see what will happen after that. In the meantime, I might be able to find some time to reproduce your case in GKE. I will follow up here if later I need some more details on your setup.
Thank you!
Marking as a bug because the error message is uselessly poor. Once that is cleared up we can see where it goes.
And I think there are two bugs with the error details. 1) "io exception" always includes a cause in grpc-java; it seems a status description was copied from a stub error but not the cause. 2) UNKNOWN with "Channel Pipeline:" error; this means the error wasn't propagated cleanly
Hey, has anyone perhaps had a chance to look at this yet?
Update: the current ownership of the ALTS-related stuff is a bit unclear, and we(me and @matthewstevenson88) raised our questions a couple of months ago, but unfortunately still haven't got any reply yet. As for this particular bug, I think I will take over it from @matthewstevenson88, and hopefully will have some time to do it this quarter.
Hi Eric, can we add @erm-g to this issue as well? For some reason I couldn't add him to the assignee list. Thank you so much!
Hey, has anyone perhaps had a chance to look at this yet?
Hi @wjohnst3 , we just merged a fix for unclear error propagation. It'll be included with the next (1.60) release. Would you able to get the latest and share the updated stacktrace?
Closing since there is no response and we can't reproduce it without more info. Please reopen if needed.