akka-projection
akka-projection copied to clipboard
GrpcServiceException: INTERNAL: RST_STREAM closed stream
When testing Projection over gRPC samples in AWS with application layer load balancer (alb ingress controller)
akka.grpc.GrpcServiceException: INTERNAL: RST_STREAM closed stream. HTTP/2 error code: INTERNAL_ERROR
client logs:
[2023-05-04 11:37:42,136] [DEBUG] [io.grpc.netty.shaded.io.grpc.netty.NettyClientHandler] [] [] [grpc-default-worker-ELG-1-2] - [id: 0x5670e347, L:/192.168.54.227:34182 - R:k8s-shopping-shopping-604179632a-148180922.us-east-2.elb.amazonaws.com/3.14.190.250:443] INBOUND RST_STREAM: streamId=5 errorCode=1
[2023-05-04 11:39:29,467] [DEBUG] [io.grpc.netty.shaded.io.grpc.netty.NettyClientHandler] [] [] [grpc-default-worker-ELG-1-2] - [id: 0x5670e347, L:/192.168.54.227:34182 - R:k8s-shopping-shopping-604179632a-148180922.us-east-2.elb.amazonaws.com/3.14.190.250:443] OUTBOUND PING: ack=false bytes=1111
[2023-05-04 11:39:29,569] [DEBUG] [io.grpc.netty.shaded.io.grpc.netty.NettyClientHandler] [] [] [grpc-default-worker-ELG-1-2] - [id: 0x5670e347, L:/192.168.54.227:34182 - R:k8s-shopping-shopping-604179632a-148180922.us-east-2.elb.amazonaws.com/3.14.190.250:443] INBOUND PING: ack=true bytes=1111
This tear down the connections and the projections are restarted.
nothing that can explain it in the server logs
Happens once per minute and that is the idle timeout for the load balancer.
Tried on server without success:
akka.http.server.http2.ping-interval=10s
In client we have channelBuilderOverrides
_.keepAliveWithoutCalls(true)
.keepAliveTime(10, TimeUnit.SECONDS)
.keepAliveTimeout(5, TimeUnit.SECONDS)
One way to work around the problem is to periodically update the consumer filter, i.e. request from client. If this is a problem that we need to solve we could automatically emit keep alive messages from the GrpcReadJournal.
Maybe related https://stackoverflow.com/questions/66818645/http2-ping-frames-over-aws-alb-grpc-keepalive-ping
I highly suspect the issue is specific to ALB which does not pass through HTTP/2 pings, and does not seem to care about them from the server to ALB, and does signal timeouts in a weird way by RST_STREAM with protocol_error to the client (which is supposed to be about client/server talking invalid HTTP/2 as far as I understand it).
Looks like a possible workaround could be possible to tune the ALB config to a much longer idle timeout.
May still be worth working around with gRPC message level keepalives, but we have not seen this from any other load balancers/proxies ALB as far as I know, so an upstream fix of some sort also sounds like it would make sense. I couldn't find any public issue tracker for ALB/ELB to look for existing reports.