akka-projection icon indicating copy to clipboard operation
akka-projection copied to clipboard

GrpcServiceException: INTERNAL: RST_STREAM closed stream

Open patriknw opened this issue 2 years ago • 1 comments

When testing Projection over gRPC samples in AWS with application layer load balancer (alb ingress controller)

akka.grpc.GrpcServiceException: INTERNAL: RST_STREAM closed stream. HTTP/2 error code: INTERNAL_ERROR

client logs:

[2023-05-04 11:37:42,136] [DEBUG] [io.grpc.netty.shaded.io.grpc.netty.NettyClientHandler] [] [] [grpc-default-worker-ELG-1-2] - [id: 0x5670e347, L:/192.168.54.227:34182 - R:k8s-shopping-shopping-604179632a-148180922.us-east-2.elb.amazonaws.com/3.14.190.250:443] INBOUND RST_STREAM: streamId=5 errorCode=1

[2023-05-04 11:39:29,467] [DEBUG] [io.grpc.netty.shaded.io.grpc.netty.NettyClientHandler] [] [] [grpc-default-worker-ELG-1-2] - [id: 0x5670e347, L:/192.168.54.227:34182 - R:k8s-shopping-shopping-604179632a-148180922.us-east-2.elb.amazonaws.com/3.14.190.250:443] OUTBOUND PING: ack=false bytes=1111
[2023-05-04 11:39:29,569] [DEBUG] [io.grpc.netty.shaded.io.grpc.netty.NettyClientHandler] [] [] [grpc-default-worker-ELG-1-2] - [id: 0x5670e347, L:/192.168.54.227:34182 - R:k8s-shopping-shopping-604179632a-148180922.us-east-2.elb.amazonaws.com/3.14.190.250:443] INBOUND PING: ack=true bytes=1111

This tear down the connections and the projections are restarted.

nothing that can explain it in the server logs

Happens once per minute and that is the idle timeout for the load balancer.

Tried on server without success:

akka.http.server.http2.ping-interval=10s

In client we have channelBuilderOverrides

    _.keepAliveWithoutCalls(true)
      .keepAliveTime(10, TimeUnit.SECONDS)
      .keepAliveTimeout(5, TimeUnit.SECONDS)

One way to work around the problem is to periodically update the consumer filter, i.e. request from client. If this is a problem that we need to solve we could automatically emit keep alive messages from the GrpcReadJournal.

Maybe related https://stackoverflow.com/questions/66818645/http2-ping-frames-over-aws-alb-grpc-keepalive-ping

patriknw avatar May 04 '23 13:05 patriknw

I highly suspect the issue is specific to ALB which does not pass through HTTP/2 pings, and does not seem to care about them from the server to ALB, and does signal timeouts in a weird way by RST_STREAM with protocol_error to the client (which is supposed to be about client/server talking invalid HTTP/2 as far as I understand it).

Looks like a possible workaround could be possible to tune the ALB config to a much longer idle timeout.

May still be worth working around with gRPC message level keepalives, but we have not seen this from any other load balancers/proxies ALB as far as I know, so an upstream fix of some sort also sounds like it would make sense. I couldn't find any public issue tracker for ALB/ELB to look for existing reports.

johanandren avatar May 04 '23 14:05 johanandren