grpc-node icon indicating copy to clipboard operation
grpc-node copied to clipboard

Error: 14 UNAVAILABLE: Stream refused by server

Open lwmxiaobei opened this issue 4 years ago • 18 comments

Problem description

hi, when the request volume is large, there will be several such errors.

Error: 14 UNAVAILABLE: Stream refused by server

It is difficult to reproduce in my local environment

Environment

  • OS name, version and architecture: Alpine Linux 3.12
  • Node version: v14.18.1
  • @grpc/grpc-js: ^1.3.1,
  • @grpc/proto-loader: ^0.5.6,

Additional context

start server:

 this.server.bindAsync(address, this._credentials, (error, port) => {
                if (error) {
                    this.logger.error(error.message, port);
                    return reject(error);
                }
                this.server.start();
                this.logger.debug(`server started on address: ${address}`);
                resolve(true);
 });

lwmxiaobei avatar Nov 16 '21 07:11 lwmxiaobei

The server may refuse to serve new requests if it is already handling too many other requests. This is normal behavior. Your client code will need to be able to recover from that error.

murgatroid99 avatar Nov 16 '21 16:11 murgatroid99

We're experiencing something similar in our code; the general way to reproduce appears to be this:

  • Find a slow request you can reliably trigger. Typically it should be slow enough to time out based on its own expiration time, but maybe it just needs to be slow enough to push remaining pending streams far enough into the future that they start being processed near the expiration date.
  • Make multiple requests at the same time, with the slow request first
  • Wait for the first request to hit a DEADLINE_EXCEEDED error due to the time out
  • Now watch the remaining requests assigned to the same channel cause Stream refused by server errors, except typically they'll start doing it around their own expiration times (even requests that should normally have a fast response); they are also not sent to the grpc service and end up failing entirely on the node client.
  • If there are enough pending requests, this can block the affected channel for quite some time and cause even more instances of the Stream refused by server errors.

Generally this will recover by itself with enough time, but if enough requests continue to be sent that the channel is constantly full, this can create a real issue. Another partial solution that seems to help is making sure we spawn additional channels that can handle additional streams while the busy channel recovers.

We ran this patch in production for a while (Node 12 environment, I know Node 14 like OP's won't work per the comments) and it does help by pushing the errors to happen quicker and generally greatly reduce the occurrence of the Stream refused by server errors.

Any other suggestions @murgatroid99 ?

thegleb avatar Dec 15 '21 03:12 thegleb

I would like to fix this, but I don't understand what might cause it, so I have some questions:

  1. What gRPC server implementation is your client communicating with when it experiences these problems?
  2. You say "they are also not sent to the grpc service and end up failing entirely on the node client". It is possible for the client to send the request over the network and receive the error over the network without the application-level service code ever seeing it. So, what is your specific observation here, that the service code is not called or that the request is not sent over the network?
  3. If you have a self-contained reproduction of this, can you share it? Otherwise, can you share trace logs from the client (set environment variables GRPC_TRACE=all and GRPC_VERBOSITY=DEBUG) and the server (probably the same, depending on the server language) of an interaction like you describe? In addition, if you can reproduce this without TLS enabled, could you also share a TCP dump of that interaction?

murgatroid99 avatar Dec 15 '21 14:12 murgatroid99

  1. Our backend is using grpc-java with netty which as I understand has no specific "max number of streams" setting currently implemented (which was my original theory - requests get stuck in some queue and gradually get aborted). Also the requests are fully unary (not using streaming).
  2. We simply have no record of the grpc server receiving and processing these requests, and thus the Stream refused by.. errors appear to be thrown exclusively by the client. The earlier streams typically get aborted with "Stream refused..." before their timeouts are reached, but the majority of them end up erroring out around their timeout values. Once all the pending streams reach their timeouts and get marked as refused, we observe the number of streams per channel quickly wind down back to 0 and the client/server pair becomes responsive again.
  3. I am attempting to reproduce with modified versions of the grpc-js client/grpc-java server examples to try to bridge the gap between our production setup and the toy environment. So far I am not able to reproduce the code 14 errors, only the expected deadline exceeded as well as occasional instances of call cancelled (which we also see, but only at a fairly low frequency). Will update this thread if I can figure out a way to trigger these on demand outside of our production environments 🙂

thegleb avatar Dec 16 '21 07:12 thegleb

OK, we can continue this when you have a way to reproduce the error in isolation.

We simply have no record of the grpc server receiving and processing these requests, and thus the "Stream refused by.." errors appear to be thrown exclusively by the client.

If I am understanding correctly, your observation is that your service handling code never receives or processes these requests. It may still be the case that the gRPC server library rejects the request without ever passing it along to the service handling code, and in fact that seems likely because "Stream refused by server" is caused by an HTTP/2 RST_STREAM with error code 7, which is generally sent by the server, and the spec requires that a server that sends that response does not provide that request to the server application for processing. If you have a specific reason to believe that is not the case, and that the gRPC server library never even sees the request, that would be important to know here.

murgatroid99 avatar Dec 16 '21 16:12 murgatroid99

I also want to mention that I will be off for the rest of the year. I will continue to look into this in January, but I do not expect to respond again until then.

murgatroid99 avatar Dec 16 '21 21:12 murgatroid99

I also took some time off, but now am dedicating more time to this. So far I have had no luck reproducing outside of our stack even after changing the grpc-java example server to provide mock slow/fast responses on command.

I was able to grab the grpc-js debug chatter using our production stack. You are correct about error code 7 happening. Additionally, any requests triggered after maybe 10-20 mins should be rejected due to token expiration, but almost an hour later they continue to fail with the same errors, pretty much confirming It may still be the case that the gRPC server library rejects the request without ever passing it along to the service handling code because the expired token does not end up getting checked by the service.

Test case and logs below (I obfuscated some internal identifiers and headers). Any suggestions of what I should look into next?

Test case

  1. Set client-side timeout to 10 seconds.
  2. Trigger 10 fast queries, observe successful response for all 10 within ~5-6s (as expected).
  3. Trigger 10 fast queries 2 more times, observe successful response for all in ~150ms (cached responses). This verifies that there is nothing blocking responses from the server.
  4. Trigger 1 slow query. The actual query should time out at the server level after a few minutes.
  5. Wait 7 seconds. The first few tries I could trigger the slow query and fast queries one after another and the slow query would time out as expected while the fast queries will resolve as expected. Adding this delay seems to trigger the issue more consistently.
  6. Trigger 10 more fast queries which should return within ~150ms because the queries are cached.
  7. Observe 1x DEADLINE_EXCEEDED from the slow query in 3), 10x UNAVAILABLE/Stream refused from the fast queries, all returning at the same time. Latency is basically timeout minus elapsed time since the slow request was triggered.
  8. Wait some time (~11 minutes), trigger 10 more fast queries which should return fast; observe 1x DEADLINE_EXCEEDED and 9x UNAVAILABLE/Stream refused errors. Try again after the weekend to allow any active connections to be totally fully closed, same result
  9. Essentially the app is unable to communicate again until a restart and reconnect.
User-facing logs for steps 3-6; step 7 is identical except 1 timeout + 9 streams refused

2022-01-08T03:36:34.200Z - start 1 slow request
2022-01-08T03:36:41.551Z - start 10 fast requests
2022-01-08T03:36:44.281Z - GraphQL error: 4 DEADLINE_EXCEEDED: Deadline exceeded after 10.081 seconds
2022-01-08T03:36:44.285Z - GraphQL error: 14 UNAVAILABLE: Stream refused by server after 2.734 seconds
2022-01-08T03:36:44.288Z - GraphQL error: 14 UNAVAILABLE: Stream refused by server after 2.736 seconds
2022-01-08T03:36:44.291Z - GraphQL error: 14 UNAVAILABLE: Stream refused by server after 2.738 seconds
2022-01-08T03:36:44.294Z - GraphQL error: 14 UNAVAILABLE: Stream refused by server after 2.74 seconds
2022-01-08T03:36:44.298Z - GraphQL error: 14 UNAVAILABLE: Stream refused by server after 2.743 seconds
2022-01-08T03:36:44.300Z - GraphQL error: 14 UNAVAILABLE: Stream refused by server after 2.744 seconds
2022-01-08T03:36:44.303Z - GraphQL error: 14 UNAVAILABLE: Stream refused by server after 2.746 seconds
2022-01-08T03:36:44.305Z - GraphQL error: 14 UNAVAILABLE: Stream refused by server after 2.747 seconds
2022-01-08T03:36:44.308Z - GraphQL error: 14 UNAVAILABLE: Stream refused by server after 2.749 seconds
2022-01-08T03:36:44.311Z - GraphQL error: 14 UNAVAILABLE: Stream refused by server after 2.751 seconds
grpc-js chatter for steps 3->6

2022-01-08T03:36:34.213Z | channel | dns:127.0.0.1:xxxx createCall [30] method="/grpcMethod", deadline=1641613004213
2022-01-08T03:36:34.213Z | call_stream | [30] Sending metadata
2022-01-08T03:36:34.213Z | channel | Pick result: COMPLETE subchannel: 127.0.0.1:xxxx status: undefined undefined
2022-01-08T03:36:34.214Z | call_stream | [30] write() called with message of length 3194
2022-01-08T03:36:34.214Z | call_stream | [30] end() called
2022-01-08T03:36:34.214Z | call_stream | Starting stream on subchannel 127.0.0.1:xxxx with headers
		xxxx
		grpc-timeout: 9999m
		grpc-accept-encoding: identity,deflate,gzip
		accept-encoding: identity
		:authority: 127.0.0.1:xxxx
		user-agent: grpc-node-js/1.3.7
		content-type: application/grpc
		:method: POST
		:path: /grpcMethod
		te: trailers

2022-01-08T03:36:34.239Z | call_stream | [30] attachHttp2Stream from subchannel 127.0.0.1:xxxx
2022-01-08T03:36:34.239Z | subchannel_refcount | 127.0.0.1:xxxx callRefcount 0 -> 1
2022-01-08T03:36:34.239Z | call_stream | [30] sending data chunk of length 3199
2022-01-08T03:36:34.239Z | call_stream | [30] calling end() on HTTP/2 stream
2022-01-08T03:36:41.607Z | channel | dns:127.0.0.1:xxxx createCall [31] method="/grpcMethod", deadline=1641613011607
2022-01-08T03:36:41.607Z | call_stream | [31] Sending metadata
2022-01-08T03:36:41.607Z | channel | Pick result: COMPLETE subchannel: 127.0.0.1:xxxx status: undefined undefined
2022-01-08T03:36:41.608Z | call_stream | [31] write() called with message of length 72
2022-01-08T03:36:41.608Z | call_stream | [31] end() called
2022-01-08T03:36:41.608Z | call_stream | Starting stream on subchannel 127.0.0.1:xxxx with headers
		xxxx
		grpc-timeout: 9999m
		grpc-accept-encoding: identity,deflate,gzip
		accept-encoding: identity
		:authority: 127.0.0.1:xxxx
		user-agent: grpc-node-js/1.3.7
		content-type: application/grpc
		:method: POST
		:path: /grpcMethod
		te: trailers

2022-01-08T03:36:41.608Z | call_stream | [31] attachHttp2Stream from subchannel 127.0.0.1:xxxx
2022-01-08T03:36:41.608Z | subchannel_refcount | 127.0.0.1:xxxx callRefcount 1 -> 2
2022-01-08T03:36:41.608Z | call_stream | [31] sending data chunk of length 77
2022-01-08T03:36:41.608Z | call_stream | [31] calling end() on HTTP/2 stream
2022-01-08T03:36:41.608Z | channel | dns:127.0.0.1:xxxx createCall [32] method="/grpcMethod", deadline=1641613011608
2022-01-08T03:36:41.608Z | call_stream | [32] Sending metadata
2022-01-08T03:36:41.608Z | channel | Pick result: COMPLETE subchannel: 127.0.0.1:xxxx status: undefined undefined
2022-01-08T03:36:41.609Z | call_stream | [32] write() called with message of length 72
2022-01-08T03:36:41.609Z | call_stream | [32] end() called
2022-01-08T03:36:41.609Z | call_stream | Starting stream on subchannel 127.0.0.1:xxxx with headers
		xxxx
		grpc-timeout: 9999m
		grpc-accept-encoding: identity,deflate,gzip
		accept-encoding: identity
		:authority: 127.0.0.1:xxxx
		user-agent: grpc-node-js/1.3.7
		content-type: application/grpc
		:method: POST
		:path: /grpcMethod
		te: trailers

2022-01-08T03:36:41.609Z | call_stream | [32] attachHttp2Stream from subchannel 127.0.0.1:xxxx
2022-01-08T03:36:41.609Z | subchannel_refcount | 127.0.0.1:xxxx callRefcount 2 -> 3
2022-01-08T03:36:41.609Z | call_stream | [32] sending data chunk of length 77
2022-01-08T03:36:41.609Z | call_stream | [32] calling end() on HTTP/2 stream
2022-01-08T03:36:41.662Z | channel | dns:127.0.0.1:xxxx createCall [33] method="/grpcMethod", deadline=1641613011662
2022-01-08T03:36:41.663Z | call_stream | [33] Sending metadata
2022-01-08T03:36:41.663Z | channel | Pick result: COMPLETE subchannel: 127.0.0.1:xxxx status: undefined undefined
2022-01-08T03:36:41.663Z | call_stream | [33] write() called with message of length 72
2022-01-08T03:36:41.663Z | call_stream | [33] end() called
2022-01-08T03:36:41.663Z | call_stream | Starting stream on subchannel 127.0.0.1:xxxx with headers
		xxxx
		grpc-timeout: 9999m
		grpc-accept-encoding: identity,deflate,gzip
		accept-encoding: identity
		:authority: 127.0.0.1:xxxx
		user-agent: grpc-node-js/1.3.7
		content-type: application/grpc
		:method: POST
		:path: /grpcMethod
		te: trailers

2022-01-08T03:36:41.663Z | call_stream | [33] attachHttp2Stream from subchannel 127.0.0.1:xxxx
2022-01-08T03:36:41.663Z | subchannel_refcount | 127.0.0.1:xxxx callRefcount 3 -> 4
2022-01-08T03:36:41.663Z | call_stream | [33] sending data chunk of length 77
2022-01-08T03:36:41.663Z | call_stream | [33] calling end() on HTTP/2 stream
2022-01-08T03:36:41.663Z | channel | dns:127.0.0.1:xxxx createCall [34] method="/grpcMethod", deadline=1641613011663
2022-01-08T03:36:41.663Z | call_stream | [34] Sending metadata
2022-01-08T03:36:41.663Z | channel | Pick result: COMPLETE subchannel: 127.0.0.1:xxxx status: undefined undefined
2022-01-08T03:36:41.663Z | call_stream | [34] write() called with message of length 72
2022-01-08T03:36:41.663Z | call_stream | [34] end() called
2022-01-08T03:36:41.664Z | call_stream | Starting stream on subchannel 127.0.0.1:xxxx with headers
		xxxx
		grpc-timeout: 10000m
		grpc-accept-encoding: identity,deflate,gzip
		accept-encoding: identity
		:authority: 127.0.0.1:xxxx
		user-agent: grpc-node-js/1.3.7
		content-type: application/grpc
		:method: POST
		:path: /grpcMethod
		te: trailers

2022-01-08T03:36:41.664Z | call_stream | [34] attachHttp2Stream from subchannel 127.0.0.1:xxxx
2022-01-08T03:36:41.664Z | subchannel_refcount | 127.0.0.1:xxxx callRefcount 4 -> 5
2022-01-08T03:36:41.664Z | call_stream | [34] sending data chunk of length 77
2022-01-08T03:36:41.664Z | call_stream | [34] calling end() on HTTP/2 stream
2022-01-08T03:36:41.664Z | channel | dns:127.0.0.1:xxxx createCall [35] method="/grpcMethod", deadline=1641613011664
2022-01-08T03:36:41.664Z | call_stream | [35] Sending metadata
2022-01-08T03:36:41.664Z | channel | Pick result: COMPLETE subchannel: 127.0.0.1:xxxx status: undefined undefined
2022-01-08T03:36:41.664Z | call_stream | [35] write() called with message of length 72
2022-01-08T03:36:41.664Z | call_stream | [35] end() called
2022-01-08T03:36:41.664Z | call_stream | Starting stream on subchannel 127.0.0.1:xxxx with headers
		xxxx
		grpc-timeout: 10000m
		grpc-accept-encoding: identity,deflate,gzip
		accept-encoding: identity
		:authority: 127.0.0.1:xxxx
		user-agent: grpc-node-js/1.3.7
		content-type: application/grpc
		:method: POST
		:path: /grpcMethod
		te: trailers

2022-01-08T03:36:41.664Z | call_stream | [35] attachHttp2Stream from subchannel 127.0.0.1:xxxx
2022-01-08T03:36:41.664Z | subchannel_refcount | 127.0.0.1:xxxx callRefcount 5 -> 6
2022-01-08T03:36:41.664Z | call_stream | [35] sending data chunk of length 77
2022-01-08T03:36:41.664Z | call_stream | [35] calling end() on HTTP/2 stream
2022-01-08T03:36:41.664Z | channel | dns:127.0.0.1:xxxx createCall [36] method="/grpcMethod", deadline=1641613011664
2022-01-08T03:36:41.665Z | call_stream | [36] Sending metadata
2022-01-08T03:36:41.665Z | channel | Pick result: COMPLETE subchannel: 127.0.0.1:xxxx status: undefined undefined
2022-01-08T03:36:41.665Z | call_stream | [36] write() called with message of length 72
2022-01-08T03:36:41.665Z | call_stream | [36] end() called
2022-01-08T03:36:41.665Z | call_stream | Starting stream on subchannel 127.0.0.1:xxxx with headers
		xxxx
		grpc-timeout: 9999m
		grpc-accept-encoding: identity,deflate,gzip
		accept-encoding: identity
		:authority: 127.0.0.1:xxxx
		user-agent: grpc-node-js/1.3.7
		content-type: application/grpc
		:method: POST
		:path: /grpcMethod
		te: trailers

2022-01-08T03:36:41.665Z | call_stream | [36] attachHttp2Stream from subchannel 127.0.0.1:xxxx
2022-01-08T03:36:41.665Z | subchannel_refcount | 127.0.0.1:xxxx callRefcount 6 -> 7
2022-01-08T03:36:41.665Z | call_stream | [36] sending data chunk of length 77
2022-01-08T03:36:41.665Z | call_stream | [36] calling end() on HTTP/2 stream
2022-01-08T03:36:41.665Z | channel | dns:127.0.0.1:xxxx createCall [37] method="/grpcMethod", deadline=1641613011665
2022-01-08T03:36:41.665Z | call_stream | [37] Sending metadata
2022-01-08T03:36:41.665Z | channel | Pick result: COMPLETE subchannel: 127.0.0.1:xxxx status: undefined undefined
2022-01-08T03:36:41.665Z | call_stream | [37] write() called with message of length 72
2022-01-08T03:36:41.665Z | call_stream | [37] end() called
2022-01-08T03:36:41.665Z | call_stream | Starting stream on subchannel 127.0.0.1:xxxx with headers
		xxxx
		grpc-timeout: 10000m
		grpc-accept-encoding: identity,deflate,gzip
		accept-encoding: identity
		:authority: 127.0.0.1:xxxx
		user-agent: grpc-node-js/1.3.7
		content-type: application/grpc
		:method: POST
		:path: /grpcMethod
		te: trailers

2022-01-08T03:36:41.665Z | call_stream | [37] attachHttp2Stream from subchannel 127.0.0.1:xxxx
2022-01-08T03:36:41.665Z | subchannel_refcount | 127.0.0.1:xxxx callRefcount 7 -> 8
2022-01-08T03:36:41.666Z | call_stream | [37] sending data chunk of length 77
2022-01-08T03:36:41.666Z | call_stream | [37] calling end() on HTTP/2 stream
2022-01-08T03:36:41.666Z | channel | dns:127.0.0.1:xxxx createCall [38] method="/grpcMethod", deadline=1641613011666
2022-01-08T03:36:41.666Z | call_stream | [38] Sending metadata
2022-01-08T03:36:41.666Z | channel | Pick result: COMPLETE subchannel: 127.0.0.1:xxxx status: undefined undefined
2022-01-08T03:36:41.666Z | call_stream | [38] write() called with message of length 72
2022-01-08T03:36:41.666Z | call_stream | [38] end() called
2022-01-08T03:36:41.666Z | call_stream | Starting stream on subchannel 127.0.0.1:xxxx with headers
		xxxx
		grpc-timeout: 10000m
		grpc-accept-encoding: identity,deflate,gzip
		accept-encoding: identity
		:authority: 127.0.0.1:xxxx
		user-agent: grpc-node-js/1.3.7
		content-type: application/grpc
		:method: POST
		:path: /grpcMethod
		te: trailers

2022-01-08T03:36:41.666Z | call_stream | [38] attachHttp2Stream from subchannel 127.0.0.1:xxxx
2022-01-08T03:36:41.666Z | subchannel_refcount | 127.0.0.1:xxxx callRefcount 8 -> 9
2022-01-08T03:36:41.666Z | call_stream | [38] sending data chunk of length 77
2022-01-08T03:36:41.666Z | call_stream | [38] calling end() on HTTP/2 stream
2022-01-08T03:36:41.666Z | channel | dns:127.0.0.1:xxxx createCall [39] method="/grpcMethod", deadline=1641613011666
2022-01-08T03:36:41.666Z | call_stream | [39] Sending metadata
2022-01-08T03:36:41.666Z | channel | Pick result: COMPLETE subchannel: 127.0.0.1:xxxx status: undefined undefined
2022-01-08T03:36:41.666Z | call_stream | [39] write() called with message of length 72
2022-01-08T03:36:41.666Z | call_stream | [39] end() called
2022-01-08T03:36:41.667Z | call_stream | Starting stream on subchannel 127.0.0.1:xxxx with headers
		xxxx
		grpc-timeout: 10000m
		grpc-accept-encoding: identity,deflate,gzip
		accept-encoding: identity
		:authority: 127.0.0.1:xxxx
		user-agent: grpc-node-js/1.3.7
		content-type: application/grpc
		:method: POST
		:path: /grpcMethod
		te: trailers

2022-01-08T03:36:41.667Z | call_stream | [39] attachHttp2Stream from subchannel 127.0.0.1:xxxx
2022-01-08T03:36:41.667Z | subchannel_refcount | 127.0.0.1:xxxx callRefcount 9 -> 10
2022-01-08T03:36:41.667Z | call_stream | [39] sending data chunk of length 77
2022-01-08T03:36:41.667Z | call_stream | [39] calling end() on HTTP/2 stream
2022-01-08T03:36:41.667Z | channel | dns:127.0.0.1:xxxx createCall [40] method="/grpcMethod", deadline=1641613011667
2022-01-08T03:36:41.667Z | call_stream | [40] Sending metadata
2022-01-08T03:36:41.667Z | channel | Pick result: COMPLETE subchannel: 127.0.0.1:xxxx status: undefined undefined
2022-01-08T03:36:41.667Z | call_stream | [40] write() called with message of length 72
2022-01-08T03:36:41.667Z | call_stream | [40] end() called
2022-01-08T03:36:41.667Z | call_stream | Starting stream on subchannel 127.0.0.1:xxxx with headers
		xxxx
		grpc-timeout: 10000m
		grpc-accept-encoding: identity,deflate,gzip
		accept-encoding: identity
		:authority: 127.0.0.1:xxxx
		user-agent: grpc-node-js/1.3.7
		content-type: application/grpc
		:method: POST
		:path: /grpcMethod
		te: trailers

2022-01-08T03:36:41.667Z | call_stream | [40] attachHttp2Stream from subchannel 127.0.0.1:xxxx
2022-01-08T03:36:41.667Z | subchannel_refcount | 127.0.0.1:xxxx callRefcount 10 -> 11
2022-01-08T03:36:41.667Z | call_stream | [40] sending data chunk of length 77
2022-01-08T03:36:41.667Z | call_stream | [40] calling end() on HTTP/2 stream
2022-01-08T03:36:44.214Z | call_stream | [30] cancelWithStatus code: 4 details: "Deadline exceeded"
2022-01-08T03:36:44.214Z | call_stream | [30] ended with status: code=4 details="Deadline exceeded"
2022-01-08T03:36:44.214Z | subchannel_refcount | 127.0.0.1:xxxx callRefcount 11 -> 10
2022-01-08T03:36:44.214Z | call_stream | [30] close http2 stream with code 8
2022-01-08T03:36:44.233Z | call_stream | [31] HTTP/2 stream closed with code 7
2022-01-08T03:36:44.233Z | call_stream | [31] ended with status: code=14 details="Stream refused by server"
2022-01-08T03:36:44.233Z | subchannel_refcount | 127.0.0.1:xxxx callRefcount 10 -> 9
2022-01-08T03:36:44.233Z | call_stream | [32] HTTP/2 stream closed with code 7
2022-01-08T03:36:44.233Z | call_stream | [32] ended with status: code=14 details="Stream refused by server"
2022-01-08T03:36:44.233Z | subchannel_refcount | 127.0.0.1:xxxx callRefcount 9 -> 8
2022-01-08T03:36:44.233Z | call_stream | [33] HTTP/2 stream closed with code 7
2022-01-08T03:36:44.233Z | call_stream | [33] ended with status: code=14 details="Stream refused by server"
2022-01-08T03:36:44.233Z | subchannel_refcount | 127.0.0.1:xxxx callRefcount 8 -> 7
2022-01-08T03:36:44.233Z | call_stream | [34] HTTP/2 stream closed with code 7
2022-01-08T03:36:44.233Z | call_stream | [34] ended with status: code=14 details="Stream refused by server"
2022-01-08T03:36:44.233Z | subchannel_refcount | 127.0.0.1:xxxx callRefcount 7 -> 6
2022-01-08T03:36:44.233Z | call_stream | [35] HTTP/2 stream closed with code 7
2022-01-08T03:36:44.233Z | call_stream | [35] ended with status: code=14 details="Stream refused by server"
2022-01-08T03:36:44.233Z | subchannel_refcount | 127.0.0.1:xxxx callRefcount 6 -> 5
2022-01-08T03:36:44.233Z | call_stream | [36] HTTP/2 stream closed with code 7
2022-01-08T03:36:44.233Z | call_stream | [36] ended with status: code=14 details="Stream refused by server"
2022-01-08T03:36:44.233Z | subchannel_refcount | 127.0.0.1:xxxx callRefcount 5 -> 4
2022-01-08T03:36:44.233Z | call_stream | [37] HTTP/2 stream closed with code 7
2022-01-08T03:36:44.233Z | call_stream | [37] ended with status: code=14 details="Stream refused by server"
2022-01-08T03:36:44.233Z | subchannel_refcount | 127.0.0.1:xxxx callRefcount 4 -> 3
2022-01-08T03:36:44.233Z | call_stream | [38] HTTP/2 stream closed with code 7
2022-01-08T03:36:44.233Z | call_stream | [38] ended with status: code=14 details="Stream refused by server"
2022-01-08T03:36:44.233Z | subchannel_refcount | 127.0.0.1:xxxx callRefcount 3 -> 2
2022-01-08T03:36:44.233Z | call_stream | [39] HTTP/2 stream closed with code 7
2022-01-08T03:36:44.233Z | call_stream | [39] ended with status: code=14 details="Stream refused by server"
2022-01-08T03:36:44.233Z | subchannel_refcount | 127.0.0.1:xxxx callRefcount 2 -> 1
2022-01-08T03:36:44.233Z | call_stream | [40] HTTP/2 stream closed with code 7
2022-01-08T03:36:44.233Z | call_stream | [40] ended with status: code=14 details="Stream refused by server"
2022-01-08T03:36:44.234Z | subchannel_refcount | 127.0.0.1:xxxx callRefcount 1 -> 0

thegleb avatar Jan 08 '22 04:01 thegleb

Would it be possible to reproduce this error with TLS disabled? If you could get a TCP dump of this interaction, we could conclusively determine where the error comes from.

murgatroid99 avatar Jan 10 '22 20:01 murgatroid99

Will look into getting a TCP dump - need to do some more research to figure that out. However, this might provide another clue - turns out this condition is triggered by a single GOAWAY sent by our server in response to a bad request (that I thought was slow because it would time out 100% of the time). The last log entry by the connection proxy shows this message, upon receiving an error response:

2022/01/10 16:46:41 http2: received GOAWAY [FrameHeader GOAWAY len=8], starting graceful shutdown

This is received after ~5 secs, and the client does not react until the request expires at 10 secs, and after this error, any new requests do not get sent to the server at all per the chatter in the connection proxy logs. I suspect something can be done from both the library side to handle this state better, or maybe in our implementation, as well as from the backend to avoid this condition (only one type of failure seems to trigger this type of failure for us). I'm investigating this further to see how I can trigger it on demand in my mock Java server.

EDIT: this might be an entirely separate issue from the OP, I'm happy to re-file and continue this as a separate issue once a little more detail about it is understood

thegleb avatar Jan 11 '22 05:01 thegleb

Is your server application code purposely shutting down the server when handling the bad request? If not, I don't know why it would send a GOAWAY like that. I will note that if the server sends a GOAWAY, but does not respond to the specific stream, it is the expected behavior that the client request will continue waiting for a response and eventually time out.

The proxy logs may answer the same questions I wanted to answer with a TCP dump. When you say "any new requests do not get sent to the server at all", do you mean that the client does not send anything to the proxy, or that the proxy does not send anything to the server?

murgatroid99 avatar Jan 11 '22 18:01 murgatroid99

Is your server application code purposely shutting down the server when handling the bad request? If not, I don't know why it would send a GOAWAY like that.

No, not intentional, and possibly caused by some faulty exception handling logic we might be able to fix - we're planning to look into how this exception is handled once my teammates free up next week to help me test.

I will note that if the server sends a GOAWAY, but does not respond to the specific stream, it is the expected behavior that the client request will continue waiting for a response and eventually time out.

Is there anything we can do on the client side to avoid needing a full restart of the client to re-establish the connection in this situation? Any way to periodically close and reopen the channel?

I did more targeted testing using the hacky code in https://github.com/grpc/grpc-node/pull/1698 and the reason it works is because it terminates the connection after every error, in this case reacting to the first DEADLINE_EXCEEDED from the request that created the GOAWAY situation.

logs for the above example with the patch included

2022-01-12T04:03:15.395Z | channel | dns:127.0.0.1:xxxx createCall [33] method="/grpcMethod", deadline=1641960205395
2022-01-12T04:03:15.395Z | call_stream | [33] Sending metadata
2022-01-12T04:03:15.395Z | channel | Pick result: COMPLETE subchannel: 127.0.0.1:xxxx status: undefined undefined
2022-01-12T04:03:15.396Z | call_stream | [33] write() called with message of length 3194
2022-01-12T04:03:15.396Z | call_stream | [33] end() called
2022-01-12T04:03:15.396Z | call_stream | Starting stream on subchannel 127.0.0.1:xxxx with headers
		rpc-caller: grpc-test
		rpc-encoding: proto
		grpc-timeout: 9999m
		grpc-accept-encoding: identity,deflate,gzip
		accept-encoding: identity
		:authority: 127.0.0.1:xxxx
		user-agent: grpc-node-js/1.3.7
		content-type: application/grpc
		:method: POST
		:path: /grpcMethod
		te: trailers

2022-01-12T04:03:15.396Z | call_stream | [33] attachHttp2Stream from subchannel 127.0.0.1:xxxx
2022-01-12T04:03:15.396Z | subchannel_refcount | 127.0.0.1:xxxx callRefcount 0 -> 1
2022-01-12T04:03:15.396Z | call_stream | [33] sending data chunk of length 3199
2022-01-12T04:03:15.397Z | call_stream | [33] calling end() on HTTP/2 stream
2022-01-12T04:03:25.395Z | call_stream | [33] cancelWithStatus code: 4 details: "Deadline exceeded"
2022-01-12T04:03:25.395Z | call_stream | [33] ended with status: code=4 details="Deadline exceeded"
2022-01-12T04:03:25.395Z | subchannel_refcount | 127.0.0.1:xxxx callRefcount 1 -> 0
2022-01-12T04:03:25.395Z | call_stream | [33] destroy http2 stream after error 
2022-01-12T04:03:25.396Z | call_stream | [33] Node error event: message=Stream error code=undefined errno=Unknown system error undefined syscall=undefined
2022-01-12T04:03:25.396Z | call_stream | [33] HTTP/2 stream closed with code 2

(followed by a variation of the below, I forgot to grab this bit before clearing console the first time)

2022-01-12T04:24:35.752Z | subchannel | 127.0.0.1:xxxx connection closed
2022-01-12T04:24:35.752Z | subchannel | 127.0.0.1:xxxx READY -> IDLE

The behavior with the patch isn't exclusive to handling this type of condition and will torch the connection on any error, so I can see how it's not a long-term solution.

When you say "any new requests do not get sent to the server at all", do you mean that the client does not send anything to the proxy, or that the proxy does not send anything to the server?

The first one. There are no records of outbound requests in the proxy for requests initiated after the GOAWAY is returned. Any requests that were in flight at the same time as the bad request (e.g. initiated after the bad request but before the GOAWAY happens) will be sent to the service and time out as well, and the rest will follow the client-side deadline exceeded / stream refused pattern outlined above.

thegleb avatar Jan 12 '22 04:01 thegleb

I will note that if the server sends a GOAWAY, but does not respond to the specific stream, it is the expected behavior that the client request will continue waiting for a response and eventually time out.

Is there anything we can do on the client side to avoid needing a full restart of the client to re-establish the connection in this situation? Any way to periodically close and reopen the channel?

What I meant is that we would expect that single request to time out if the server behaves that way. It is also not surprising for the other calls to end with "stream refused" errors specifically because they had already started on that connection when the server shut that connection down. Any requests made after those calls fail should create a new connection and use it.

I have suspected for a while that Node itself has a bug that causes it to sometimes not report GOAWAYs, but previous reports of similar problems never had a consistent reproduction like you seem to have here. If you could create and share a completely isolated reproduction of this error, that would be really helpful.

murgatroid99 avatar Jan 12 '22 17:01 murgatroid99

I'm seeing the same problem: our Go gRPC server has a max connection age setting that will have it send a GOAWAY to make sure connections are relatively evenly distributed across server instances. On the client side, Error: 14 UNAVAILABLE: Stream refused by server errors are being logged by grpc-js when this happens.

ldx avatar Jan 12 '22 17:01 ldx

If you could create and share a completely isolated reproduction of this error, that would be really helpful.

Still working on this; I have a suspicion the GOAWAY in our case might be swallowed somewhere upstream of the client, but looking into it deeper this week to find conclusive evidence one way or another.

I did find an alternate solution that appears to work for this type of case - setting grpc.keepalive_time_ms and grpc.keepalive_timeout_ms values. This should cut down the detection time by a bit compared to the other sledgehammer approach, because this way the client can successfully send new requests after the keepalive ping times out, unlike the patch where the client reacts only after the first request times out.

Example logs when I set those to 1000ms/1000ms

* the request is made
* 2022-01-15T00:48:29.608Z | call_stream | [0] attachHttp2Stream from subchannel 127.0.0.1:xxxx

2022-01-15T00:48:29.608Z | subchannel_refcount | 127.0.0.1:xxxx callRefcount 0 -> 1
2022-01-15T00:48:29.609Z | call_stream | [0] sending data chunk of length 3199 (deferred)
2022-01-15T00:48:29.609Z | call_stream | [0] calling end() on HTTP/2 stream
2022-01-15T00:48:30.609Z | keepalive | Sending ping to 127.0.0.1:xxxx
2022-01-15T00:48:31.610Z | keepalive | Sending ping to 127.0.0.1:xxxx
2022-01-15T00:48:32.610Z | keepalive | Sending ping to 127.0.0.1:xxxx
2022-01-15T00:48:33.611Z | keepalive | Sending ping to 127.0.0.1:xxxx
2022-01-15T00:48:34.612Z | keepalive | Sending ping to 127.0.0.1:xxxx

* error occurs around this timestamp, and the client does not hear back from this ping
* 2022-01-15T00:48:35.613Z | keepalive | Sending ping to 127.0.0.1:xxxx

* channel gets closed and new requests triggered after this moment can be processed
* 2022-01-15T00:48:36.614Z | subchannel | 127.0.0.1:xxxx READY -> IDLE

2022-01-15T00:48:36.614Z | subchannel_refcount | 127.0.0.1:xxxx refcount 2 -> 1
2022-01-15T00:48:36.615Z | pick_first | READY -> IDLE
2022-01-15T00:48:36.615Z | resolving_load_balancer | dns:127.0.0.1:xxxx READY -> IDLE
2022-01-15T00:48:36.615Z | connectivity_state | dns:127.0.0.1:xxxx READY -> IDLE

* bad request finally times out and falls off
* 2022-01-15T00:48:39.589Z | call_stream | [0] cancelWithStatus code: 4 details: "Deadline exceeded"

2022-01-15T00:48:39.589Z | call_stream | [0] ended with status: code=4 details="Deadline exceeded"
2022-01-15T00:48:39.589Z | subchannel_refcount | 127.0.0.1:xxxx callRefcount 1 -> 0
2022-01-15T00:48:39.590Z | call_stream | [0] close http2 stream with code 8
Client-side experience

2022-01-15T01:04:16.568Z - start 1 bad request(s)

// 4 requests that would succeed normally, but they are all queued after the bad request
2022-01-15T01:04:16.943Z - start 1 good request(s)
2022-01-15T01:04:17.120Z - start 1 good request(s)
2022-01-15T01:04:17.286Z - start 1 good request(s)
2022-01-15T01:04:17.452Z - start 1 good request(s)

// bad request times out
2022-01-15T01:04:26.616Z - GraphQL error: 4 DEADLINE_EXCEEDED: Deadline exceeded after 10.048 seconds

// good requests get refused because the connection they were on got broken
2022-01-15T01:04:26.623Z - GraphQL error: 14 UNAVAILABLE: Stream refused by server after 9.68 seconds
2022-01-15T01:04:26.626Z - GraphQL error: 14 UNAVAILABLE: Stream refused by server after 9.506 seconds
2022-01-15T01:04:26.630Z - GraphQL error: 14 UNAVAILABLE: Stream refused by server after 9.344 seconds
2022-01-15T01:04:26.634Z - GraphQL error: 14 UNAVAILABLE: Stream refused by server after 9.182 seconds

// however we can make new requests successfully
2022-01-15T01:04:31.380Z - start 1 good request(s)
2022-01-15T01:04:31.533Z - succeeded after 0.152 seconds 

thegleb avatar Jan 19 '22 21:01 thegleb

Success - I can reproduce our specific condition outside of our backend. Probably final update for this thread as it makes sense to me to open a separate issue to dig into that specific behavior, but tl; dr; in our case it seems like there's an extremely long error stack trace that gets sent back with the "bad" response and the last messages logged at the server are

i.g.n.s.i.grpc.netty.NettyServerHandler  : [id: 0x94f7d7c7, L:/127.0.0.1:xxxx - R:/127.0.0.1:yyyyy] OUTBOUND HEADERS: streamId=1 headers=GrpcHttp2OutboundHeaders[:status: 200, content-type: application/grpc, rpc-service: service-name, grpc-status: 3, grpc-message: <snip, extremely long error message>
i.g.n.s.i.grpc.netty.NettyServerHandler  : [id: 0x94f7d7c7, L:/127.0.0.1:xxxx - R:/127.0.0.1:yyyyy] INBOUND GO_AWAY: lastStreamId=0 errorCode=9 length=0 bytes=

The limit seems to be somewhere north of ~80k characters, which is a bit strange to me since AFAIK the header size limit is 8kb in Node 12 that this is running on.

I suspect the http2 client rejects the response and the grpc client never sees it. As soon as I trim the error response, the request/response pair works as expected. I'll include a sample client/server pair in a separate issue.

thegleb avatar Jan 26 '22 22:01 thegleb

In case it's helpful I just got this one time while running a Google Cloud Function which calls a Firestore in Datastore Mode database within it. (Same endpoint has processed millions of requests without running in to it before, though)

Trace:

Error: 14 UNAVAILABLE: Stream refused by server
    at Object.callErrorFromStatus (/workspace/node_modules/@grpc/grpc-js/build/src/call.js:31:26)
    at Object.onReceiveStatus (/workspace/node_modules/@grpc/grpc-js/build/src/client.js:189:52)
    at Object.onReceiveStatus (/workspace/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:365:141)
    at Object.onReceiveStatus (/workspace/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:328:181)
    at /workspace/node_modules/@grpc/grpc-js/build/src/call-stream.js:187:78
    at processTicksAndRejections (node:internal/process/task_queues:78:11)

bcjordan avatar Jul 06 '22 13:07 bcjordan

Quick update, still getting these sporadically - 8 instances in the last 30 days (out of ~3mm successful requests).

bcjordan avatar Sep 07 '22 15:09 bcjordan

I'm seeing this on a firebase function interacting with firestore too - sporadically.

rajivsubra1981 avatar Sep 19 '22 10:09 rajivsubra1981

The original issue seems to have been resolved.

Regarding "Stream refused by server" errors in general, since @grpc/grpc-js 1.8.0, this error should usually be transparently retried, so it should very rarely be surfaced.

murgatroid99 avatar Dec 14 '23 22:12 murgatroid99