grpc-java icon indicating copy to clipboard operation
grpc-java copied to clipboard

GRPC+ALTS client connection exception with Workload Identity

Open wjohnst3 opened this issue 3 years ago • 8 comments

What version of gRPC-Java are you using?

I am using the Google BOM in Maven.

            <dependency>
                <groupId>com.google.cloud</groupId>
                <artifactId>libraries-bom</artifactId>
                <version>25.1.0</version>
                <type>pom</type>
                <scope>import</scope>
            </dependency>

This contains GRPC 1.45.0

What is your environment?

Docker container running as GKE on GCP. Image from google/cloud-sdk:latest. Java 17. Google Endpoints is on the classpath running in Jetty, fyi. Worth mentioning GCP is using Workload Identity, so the service account is annotated so as to be managed by Google.

What did you expect to see?

I am testing the use of GRPC+ALTS for use in connecting an old service to another service. I would expect to see the call hit the server and return a result. I will add that if I use an unsecured ManagedChannelBuilder then the call works fine.

client (example):

 var channel =  AltsChannelBuilder.forAddress("my-service", 6565).addTargetServiceAccount("expected-service-account") .build();
 var myService = MyServiceGrpc.newBlockingStub(channel);
 var result = myService.getValue(MyServiceInput.newBuilder().setKey("key").build());

What did you see instead?

I see an "unknown" exception thrown by the client, "io.grpc.StatusRuntimeException: UNKNOWN: Channel Pipeline: [TsiHandshakeHandler#0, WriteBufferingAndExceptionHandler#0, DefaultChannelPipeline$TailContext#0]"

I spoke with "Eric Anderson" on Gitter. He suggested I file this as an issue.

Steps to reproduce the bug

Error: "io.grpc.StatusRuntimeException: UNKNOWN: Channel Pipeline: [TsiHandshakeHandler#0, WriteBufferingAndExceptionHandler#0, DefaultChannelPipeline$TailContext#0]"

wjohnst3 avatar Apr 05 '22 10:04 wjohnst3

It seems the TsiHandshaker isn't handling an exception properly, as otherwise we wouldn't get the ugly UNKNOWN on client-side. In the server logs we see "StatusRuntimeException: UNAVAILABLE: io exception", which seems likely to mean the connection to the metadata server was severed. Although why that would happen isn't immediately clear to me.

Seems like it is consistent, so it is probably worthwhile to set up a reproduction environment.

ejona86 avatar Apr 05 '22 15:04 ejona86

Oh, we see "Received a terminating error" in the log, which means this code is being triggered which confirms that the failure is with the metadata server: https://github.com/grpc/grpc-java/blob/d4fa0ecc07495097453b0a2848765f076b9e714c/alts/src/main/java/io/grpc/alts/internal/AltsHandshakerStub.java#L112-L114

Also, it explains why I don't see the actual IOException that caused the failure, because the code is written to only propagate strings and it only collected the message of the exception before throwing it away.

ejona86 avatar Apr 05 '22 15:04 ejona86

Any new with this? I would be happy to assist since I'm the one with the broken env??

wjohnst3 avatar Apr 22 '22 12:04 wjohnst3

Hi @wjohnst3

I think our first step is to fix the error message. I will help doing that, and let's see what will happen after that. In the meantime, I might be able to find some time to reproduce your case in GKE. I will follow up here if later I need some more details on your setup.

Thank you!

ZhenLian avatar Apr 22 '22 18:04 ZhenLian

Marking as a bug because the error message is uselessly poor. Once that is cleared up we can see where it goes.

ejona86 avatar Apr 22 '22 18:04 ejona86

And I think there are two bugs with the error details. 1) "io exception" always includes a cause in grpc-java; it seems a status description was copied from a stub error but not the cause. 2) UNKNOWN with "Channel Pipeline:" error; this means the error wasn't propagated cleanly

ejona86 avatar Apr 22 '22 19:04 ejona86

Hey, has anyone perhaps had a chance to look at this yet?

wjohnst3 avatar Jun 27 '22 10:06 wjohnst3

Update: the current ownership of the ALTS-related stuff is a bit unclear, and we(me and @matthewstevenson88) raised our questions a couple of months ago, but unfortunately still haven't got any reply yet. As for this particular bug, I think I will take over it from @matthewstevenson88, and hopefully will have some time to do it this quarter.

ZhenLian avatar Jul 06 '22 22:07 ZhenLian

Hi Eric, can we add @erm-g to this issue as well? For some reason I couldn't add him to the assignee list. Thank you so much!

ZhenLian avatar Dec 01 '22 00:12 ZhenLian

Hey, has anyone perhaps had a chance to look at this yet?

Hi @wjohnst3 , we just merged a fix for unclear error propagation. It'll be included with the next (1.60) release. Would you able to get the latest and share the updated stacktrace?

erm-g avatar Nov 03 '23 15:11 erm-g

Closing since there is no response and we can't reproduce it without more info. Please reopen if needed.

erm-g avatar Feb 08 '24 17:02 erm-g