grpc-node icon indicating copy to clipboard operation
grpc-node copied to clipboard

Receiving RST_STREAM with error code 2 while using grpc secure channel

Open riyashah17 opened this issue 3 years ago • 21 comments

Problem description

We are using grpc to communicate between services. For production deployment, we switched to server authentication. We have a bi-directional streaming grpc service - which expects audio chunks in streaming fashion & sends transcripts as responses in a streaming manner. After getting the final response, we get this error: 13 INTERNAL: Received RST_STREAM with error code 2 Wasn't getting this error in case of insecure channel/no authentication. Furthermore, can't find any error on the server side.

Reproduction steps

Will try to include a repo soon.

Environment

  • OS name, version and architecture: Linux Ubuntu 18.04.2 LTS x86-64
  • Node version: v10.19.0
  • Package name and version: "@grpc/grpc-js": "1.1.0", "grpc": "1.24.4"

Additional context

Using AWS grpc load balancer for server. Server code is implemented in python. Had written a small piece of code to check if this error is encountered in the case of a python client, and we observed that this error is not seen there.

Nodejs client code for initializing the grpc client:

const ssl_creds = grpc.credentials.createSsl();
let client = new zapr_speech_proto.Speech(this.server_address, ssl_creds);

riyashah17 avatar Apr 11 '21 12:04 riyashah17

That error code indicates an "internal error", which can have a variety of causes. I suggest updating to a newer version of @grpc/grpc-js, which has more detailed information for that error, and then testing again.

murgatroid99 avatar Apr 12 '21 17:04 murgatroid99

@murgatroid99 Hey, thanks for the prompt reply. I upgraded the @grpc/grpc-js package version to 1.2.12 and tested again: Still getting the same error: Error: 13 INTERNAL: Received RST_STREAM with code 2 (Internal server error)

riyashah17 avatar Apr 12 '21 18:04 riyashah17

Unfortunately, that message means that that is all of the information the client has about the error. It looks like the server, or maybe some intermediate proxy, is sending an HTTP/2-level error code 2, which indicates an "internal error".

murgatroid99 avatar Apr 14 '21 21:04 murgatroid99

@riyashah17 Our balancer is cutting stream when data is not being sent for more than 1 min. Maybe you have the same problem

Kupstas avatar Apr 28 '21 17:04 Kupstas

If that is the problem, it may help to change the keepalive time so that the client sends pings more frequently than once per minute, to keep the connection alive. That is controlled by the channel option "grpc.keepalive_time_ms".

murgatroid99 avatar Apr 29 '21 17:04 murgatroid99

@murgatroid99 thanks for response. I've tried a few keep-alive time intervals but it has no effect

Kupstas avatar Apr 29 '21 17:04 Kupstas

We're getting the same issue with AWS ELB http/2. Attached image with keepalive didn't help.

image

Update: I think we'll have to set GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA to 0 in order to keep sending pings to keep the connection alive. I'll post here if it fixes our problem.

hamxabaig avatar Nov 30 '21 06:11 hamxabaig

The options in that image will not enable keepalive. To enable keepalive, you need to set the option grpc.keepalive_time_ms, which determines when pings will be sent. The grpc.keepalive_timeout_ms option specifies how long the client will wait for a ping response before deciding that the connection has been lost. The other two options are not understood by @grpc/grpc-js; it will always send pings without data.

murgatroid99 avatar Nov 30 '21 14:11 murgatroid99

@murgatroid99 Thank you so much for replying, I'll try grpc.keepalive_time_ms and let you know if it works.

hamxabaig avatar Nov 30 '21 18:11 hamxabaig

This might solve your problem. Please check your runtimeconfig and set "use": { "emulator": "false" }, I am so sorry if you have already tried this.

hkharmfulbear avatar Dec 26 '21 06:12 hkharmfulbear

Guys, did you manage to solve this problem?

alexsanderp avatar Feb 03 '22 22:02 alexsanderp

Our problem was AWS Loadbalancer, it was dropping the connection after 60secs. Changing its config fixed it

hamxabaig avatar Feb 05 '22 18:02 hamxabaig

Our problem was AWS Loadbalancer, it was dropping the connection after 60secs. Changing its config fixed it

Would you mind sharing how you changed the configuration? We are encountering the same problem.

DominikSchwager avatar Jun 01 '22 05:06 DominikSchwager

@DominikSchwager you can go the load balancer section in AWS console, select the LB and then you can see the "Attributes" section in the bottom half of screen after a bit of scrolling. There you can change it. But even that didn't work out for me. Did anyone solve this by some other method?

krypticJinkle avatar Jul 21 '22 07:07 krypticJinkle

If you try the old grpc package with AWS ALB and grpc.keepalive_time_ms set to be lower than ALB idle timeout setting, the issue is not present.

So that tells me that the issue lies somewhere in @grpc/grpc-js package

jovanmilenkoski avatar Nov 04 '22 11:11 jovanmilenkoski

add keep-alive parameters with appropriate values

krypticJinkle avatar Nov 10 '22 13:11 krypticJinkle

We're running into a similar situation where we run an AWS Lambda that does some gRPC calls to one of our services. We're running an AWS ALB (Application Load Balancer) which has its timeout configured at 60 seconds.

Most of our calls work without issues, but every few hours, we're seeing the Received RST_STREAM with code 2 (Internal server error) error pop up. So it's a bit hard to reproduce.

Things we've observed so far:

  • grpc-js 1.8.0 was working just fine. The problems started popping up when we updated to 1.8.21. Updating to the latest 1.9.5 didn't fix things.
  • Configured grpc.keepalive_time_ms to be 50 seconds (10 seconds below the AWS ALB timeout of 60 seconds)

Here's a debug log leading up to the issue:

D 2023-10-11T02:53:42.692Z | channel | (2) dns:REDACTED:443 createResolvingCall [152] method=""/REDACTED"", deadline=2023-10-11T02:53:47.692Z
D 2023-10-11T02:53:42.730Z | resolving_call | [152] Created
D 2023-10-11T02:53:42.730Z | resolving_call | [152] Deadline will be reached in 4962ms
D 2023-10-11T02:53:42.730Z | resolving_call | [152] Deadline: 2023-10-11T02:53:47.692Z
D 2023-10-11T02:53:42.730Z | resolving_call | [152] start called
D 2023-10-11T02:53:42.731Z | resolving_call | [152] startRead called
D 2023-10-11T02:53:42.750Z | resolving_call | [152] halfClose called
D 2023-10-11T02:53:42.750Z | resolving_call | [152] write() called with message of length 38
D 2023-10-11T02:53:42.770Z | channel | (2) dns:REDACTED:443 createLoadBalancingCall [154] method=""/REDACTED""
D 2023-10-11T02:53:42.770Z | channel | (2) dns:REDACTED:443 createRetryingCall [153] method=""/REDACTED""
D 2023-10-11T02:53:42.770Z | resolving_call | [152] Created child [153]
D 2023-10-11T02:53:42.770Z | retrying_call | [153] start called
D 2023-10-11T02:53:42.771Z | load_balancing_call | [154] Pick called
D 2023-10-11T02:53:42.771Z | load_balancing_call | [154] Pick result: COMPLETE subchannel: (35) 10.130.2.141:443 status: undefined undefined
D 2023-10-11T02:53:42.771Z | load_balancing_call | [154] start called
D 2023-10-11T02:53:42.771Z | load_balancing_call | [154] startRead called
D 2023-10-11T02:53:42.771Z | retrying_call | [153] Created child call [154] for attempt 1
D 2023-10-11T02:53:42.771Z | retrying_call | [153] startRead called
D 2023-10-11T02:53:42.791Z | transport_flowctrl | (44) 10.130.2.141:443 local window size: 65066 remote window size: 2147483127
D 2023-10-11T02:53:42.791Z | transport_internals | (44) 10.130.2.141:443 session.closed=false session.destroyed=false session.socket.destroyed=false
D 2023-10-11T02:53:42.810Z | keepalive | (44) 10.130.2.141:443 Sending ping with timeout 20000ms
D 2023-10-11T02:53:42.810Z | load_balancing_call | [154] Created child call [155]
D 2023-10-11T02:53:42.810Z | retrying_call | [153] write() called with message of length 43
D 2023-10-11T02:53:42.870Z | load_balancing_call | [154] write() called with message of length 43
D 2023-10-11T02:53:42.870Z | retrying_call | [153] halfClose called
D 2023-10-11T02:53:42.870Z | subchannel_call | [155] sending data chunk of length 43
D 2023-10-11T02:53:42.870Z | subchannel_call | [155] write() called with message of length 43
D 2023-10-11T02:53:42.992Z | subchannel_call | [155] ended with status: code=13 details=""Received RST_STREAM with code 2 (Internal server error)""
D 2023-10-11T02:53:42.992Z | subchannel_call | [155] HTTP/2 stream closed with code 2
D 2023-10-11T02:53:43.030Z | load_balancing_call | [154] ended with status: code=13 details=""Received RST_STREAM with code 2 (Internal server error)""
D 2023-10-11T02:53:43.030Z | load_balancing_call | [154] Received status
D 2023-10-11T02:53:43.030Z | retrying_call | [153] Received status from child [154]
D 2023-10-11T02:53:43.030Z | retrying_call | [153] state=TRANSPARENT_ONLY handling status with progress PROCESSED from child [154] in state ACTIVE
D 2023-10-11T02:53:43.050Z | resolving_call | [152] ended with status: code=13 details=""Received RST_STREAM with code 2 (Internal server error)""
D 2023-10-11T02:53:43.050Z | resolving_call | [152] Received status
D 2023-10-11T02:53:43.050Z | retrying_call | [153] ended with status: code=13 details=""Received RST_STREAM with code 2 (Internal server error)""
D 2023-10-11T02:53:43.132Z | keepalive | (44) 10.130.2.141:443 Ping failed with error HTTP2 ping cancelled
D 2023-10-11T02:53:43.170Z | keepalive | (44) 10.130.2.141:443 Received ping response

As you can see, the keepalive call is sent correctly (Sending ping with timeout 20000ms), and this happens consistently as we expect, but it doesn't seem to help regarding the RST_STREAM error.

I'll go back to 1.8.0 for now and will slowly work towards the problematic 1.8.21, so we can identify which version introduced this problem. This might take a couple days though because the error only pops up every few hours.

dennisameling avatar Oct 11 '23 06:10 dennisameling

To clarify, does the ALB drop the connection after 60 seconds of inactivity, or just 60 seconds after it is created? If it is inactivity, is the client inactive for the 60 seconds preceding that log? If so, the keepalive setting is likely not actually helping, because keepalive pings are only sent if there is a pending request. That behavior can be changed with the channel option grpc.keepalive_permit_without_calls. Note that this will just cause it to keep pinging while the client is not actively using the connection, which may eventually result in the ALB dropping the connection anyway. However, if that connection drop is detected by the keepalives, it still shouldn't result in the kind of request failure you are seeing.

murgatroid99 avatar Oct 11 '23 17:10 murgatroid99

Sorry for the late follow-up here.

To clarify, does the ALB drop the connection after 60 seconds of inactivity, or just 60 seconds after it is created?

60 seconds of inactivity, that is. I wonder if keepalive_permit_without_calls would be of any help here, also considering my notes below. What do you think?

I wanted to capture the logs of the full lifetime of a problematic Lambda. That was more difficult than expected, because we couldn't filter the logs by something like "Lambda unique ID" and have multiple Lambdas running in parallel. I added such ID to the logs, which allowed me to capture the full lifetime:

lambda-grpc-logs.csv

Some things that stand out to me:

  • The Lambda lived for 70 minutes. It handled 177 requests during that time, only one of which failed with the RST_STREAM error. This is in line with what we're seeing across the board: only a very small subset of requests fails.
  • There's 18 "Ping failed with error HTTP2 ping cancelled" messages, and in 17 out of 18 cases, it doesn't result in further problems. However, just after the RST_STREAM error occurred, it showed this "Ping failed" error as well.
  • The client recieves a connection closed by GOAWAY with code 0 in the second before the RST_STREAM error pops up, but it looks like it's not handled properly and results in the RST_STREAM error (is this a correct interpretation of the logs?)

I wonder if we could prevent the GOAWAY somehow or if our code (or the library?) would need to be tuned to better handle this scenario. What do you think?

Subset of logs in the seconds leading up to the RST_STREAM error

22/10/2023 22:34;D 2023-10-22T20:34:52.466Z | keepalive | (23) 10.132.1.153:443 Starting keepalive timer for 50000ms
22/10/2023 22:34;D 2023-10-22T20:34:52.485Z | subchannel | (6) 10.132.1.153:443 CONNECTING -> READY
22/10/2023 22:34;"D 2023-10-22T20:34:52.528Z | transport | (23) 10.132.1.153:443 local settings acknowledged by remote: {""headerTableSize"":4096,""enablePush"":true,""initialWindowSize"":65535,""maxFrameSize"":16384,""maxConcurrentStreams"":4294967295,""maxHeaderListSize"":4294967295,""maxHeaderSize"":4294967295,""enableConnectProtocol"":false}"
22/10/2023 22:37;D 2023-10-22T20:37:14.627Z | subchannel_refcount | (6) 10.132.1.153:443 refcount 1 -> 0
22/10/2023 22:37;D 2023-10-22T20:37:14.646Z | subchannel | (6) 10.132.1.153:443 READY -> IDLE
22/10/2023 22:37;D 2023-10-22T20:37:14.666Z | keepalive | (22) 10.132.2.131:443 Sending ping with timeout 20000ms
22/10/2023 22:37;D 2023-10-22T20:37:14.746Z | keepalive | (23) 10.132.1.153:443 Sending ping with timeout 20000ms
22/10/2023 22:37;Handling authorization request (event version 2.0) for routeArn arn:aws:execute-api:us-west-2:REDACTED:REDACTED/production-default/POST/REDACTED/
22/10/2023 22:37;"D 2023-10-22T20:37:14.766Z | channel | (1) dns:REDACTED_INTERNAL_HOSTNAME:443 createResolvingCall [536] method=""/REDACTED/Authenticate"", deadline=2023-10-22T20:37:19.766Z"
22/10/2023 22:37;D 2023-10-22T20:37:14.786Z | resolving_call | [536] Created
22/10/2023 22:37;D 2023-10-22T20:37:14.806Z | resolving_call | [536] halfClose called
22/10/2023 22:37;D 2023-10-22T20:37:14.806Z | resolving_call | [536] write() called with message of length 212
22/10/2023 22:37;D 2023-10-22T20:37:14.806Z | resolving_call | [536] startRead called
22/10/2023 22:37;D 2023-10-22T20:37:14.806Z | resolving_call | [536] start called
22/10/2023 22:37;D 2023-10-22T20:37:14.806Z | resolving_call | [536] Deadline will be reached in 4960ms
22/10/2023 22:37;D 2023-10-22T20:37:14.806Z | resolving_call | [536] Deadline: 2023-10-22T20:37:19.766Z
22/10/2023 22:37;"D 2023-10-22T20:37:14.807Z | channel | (1) dns:REDACTED_INTERNAL_HOSTNAME:443 createRetryingCall [537] method=""/REDACTED/Authenticate"""
22/10/2023 22:37;D 2023-10-22T20:37:14.826Z | load_balancing_call | [538] startRead called
22/10/2023 22:37;D 2023-10-22T20:37:14.826Z | retrying_call | [537] startRead called
22/10/2023 22:37;D 2023-10-22T20:37:14.826Z | load_balancing_call | [538] Pick result: COMPLETE subchannel: (21) 10.132.2.131:443 status: undefined undefined
22/10/2023 22:37;D 2023-10-22T20:37:14.826Z | load_balancing_call | [538] Pick called
22/10/2023 22:37;D 2023-10-22T20:37:14.826Z | load_balancing_call | [538] start called
22/10/2023 22:37;D 2023-10-22T20:37:14.826Z | retrying_call | [537] Created child call [538] for attempt 1
22/10/2023 22:37;"D 2023-10-22T20:37:14.826Z | channel | (1) dns:REDACTED_INTERNAL_HOSTNAME:443 createLoadBalancingCall [538] method=""/REDACTED/Authenticate"""
22/10/2023 22:37;D 2023-10-22T20:37:14.826Z | retrying_call | [537] start called
22/10/2023 22:37;D 2023-10-22T20:37:14.826Z | resolving_call | [536] Created child [537]
22/10/2023 22:37;D 2023-10-22T20:37:14.827Z | transport_flowctrl | (22) 10.132.2.131:443 local window size: 65399 remote window size: 2147483604
22/10/2023 22:37;D 2023-10-22T20:37:14.846Z | load_balancing_call | [538] Created child call [539]
22/10/2023 22:37;D 2023-10-22T20:37:14.827Z | transport_internals | (22) 10.132.2.131:443 session.closed=false session.destroyed=false session.socket.destroyed=false
22/10/2023 22:37;D 2023-10-22T20:37:14.866Z | retrying_call | [537] halfClose called
22/10/2023 22:37;D 2023-10-22T20:37:14.866Z | subchannel_call | [539] sending data chunk of length 217
22/10/2023 22:37;D 2023-10-22T20:37:14.866Z | subchannel_call | [539] write() called with message of length 217
22/10/2023 22:37;D 2023-10-22T20:37:14.866Z | load_balancing_call | [538] write() called with message of length 217
22/10/2023 22:37;D 2023-10-22T20:37:14.866Z | retrying_call | [537] write() called with message of length 217
22/10/2023 22:37;D 2023-10-22T20:37:15.246Z | pick_first | READY -> IDLE
22/10/2023 22:37;D 2023-10-22T20:37:15.246Z | subchannel_refcount | (21) 10.132.2.131:443 refcount 3 -> 2
22/10/2023 22:37;D 2023-10-22T20:37:15.246Z | subchannel | (21) 10.132.2.131:443 READY -> IDLE
22/10/2023 22:37;D 2023-10-22T20:37:15.028Z | transport | (22) 10.132.2.131:443 connection closed by GOAWAY with code 0
22/10/2023 22:37;D 2023-10-22T20:37:15.266Z | connectivity_state | (1) dns:REDACTED_INTERNAL_HOSTNAME:443 READY -> IDLE
22/10/2023 22:37;D 2023-10-22T20:37:15.266Z | resolving_load_balancer | dns:REDACTED_INTERNAL_HOSTNAME:443 READY -> IDLE
22/10/2023 22:37;D 2023-10-22T20:37:15.267Z | resolving_load_balancer | dns:REDACTED_INTERNAL_HOSTNAME:443 IDLE -> CONNECTING
22/10/2023 22:37;D 2023-10-22T20:37:15.267Z | dns_resolver | Looking up DNS hostname REDACTED_INTERNAL_HOSTNAME
22/10/2023 22:37;D 2023-10-22T20:37:15.286Z | connectivity_state | (2) dns:REDACTED_INTERNAL_HOSTNAME:443 IDLE -> CONNECTING
22/10/2023 22:37;D 2023-10-22T20:37:15.286Z | resolving_load_balancer | dns:REDACTED_INTERNAL_HOSTNAME:443 IDLE -> CONNECTING
22/10/2023 22:37;D 2023-10-22T20:37:15.286Z | dns_resolver | Looking up DNS hostname REDACTED_INTERNAL_HOSTNAME
22/10/2023 22:37;D 2023-10-22T20:37:15.286Z | connectivity_state | (2) dns:REDACTED_INTERNAL_HOSTNAME:443 READY -> IDLE
22/10/2023 22:37;D 2023-10-22T20:37:15.286Z | resolving_load_balancer | dns:REDACTED_INTERNAL_HOSTNAME:443 READY -> IDLE
22/10/2023 22:37;D 2023-10-22T20:37:15.286Z | pick_first | READY -> IDLE
22/10/2023 22:37;D 2023-10-22T20:37:15.286Z | subchannel_refcount | (21) 10.132.2.131:443 refcount 2 -> 1
22/10/2023 22:37;D 2023-10-22T20:37:15.286Z | connectivity_state | (1) dns:REDACTED_INTERNAL_HOSTNAME:443 IDLE -> CONNECTING
22/10/2023 22:37;"D 2023-10-22T20:37:15.289Z | retrying_call | [537] ended with status: code=13 details=""Received RST_STREAM with code 2 (Internal server error)"""
22/10/2023 22:37;D 2023-10-22T20:37:15.289Z | retrying_call | [537] state=TRANSPARENT_ONLY handling status with progress PROCESSED from child [538] in state ACTIVE
22/10/2023 22:37;D 2023-10-22T20:37:15.289Z | retrying_call | [537] Received status from child [538]
22/10/2023 22:37;"D 2023-10-22T20:37:15.289Z | load_balancing_call | [538] ended with status: code=13 details=""Received RST_STREAM with code 2 (Internal server error)"""
22/10/2023 22:37;D 2023-10-22T20:37:15.289Z | load_balancing_call | [538] Received status
22/10/2023 22:37;"D 2023-10-22T20:37:15.289Z | subchannel_call | [539] ended with status: code=13 details=""Received RST_STREAM with code 2 (Internal server error)"""
22/10/2023 22:37;D 2023-10-22T20:37:15.289Z | subchannel_call | [539] HTTP/2 stream closed with code 2
22/10/2023 22:37;"D 2023-10-22T20:37:15.290Z | resolving_call | [536] ended with status: code=13 details=""Received RST_STREAM with code 2 (Internal server error)"""
22/10/2023 22:37;D 2023-10-22T20:37:15.289Z | resolving_call | [536] Received status
22/10/2023 22:37;"{""hostname"":""arn:aws:lambda:us-west-2:REDACTED:function:production-lambda-authorizer"",""lambda"":{""arn"":""arn:aws:lambda:us-west-2:REDACTED:function:production-lambda-authorizer"",""request_id"":""268d203b-b278-4376-8f06-1b3f467fe629""},""code"":13,""level"":""error"",""service"":""lambda-authorizer"",""details"":""Received RST_STREAM with code 2 (Internal server error)"",""lambda_unique_id"":""543e4bb7-5222-4292-9a2e-7997451b9244"",""status"":""info"",""timestamp"":1698007035347,""lambda_boot_time"":""2023-10-22T19:58:58.662Z""}"
22/10/2023 22:37;D 2023-10-22T20:37:15.369Z | keepalive | (22) 10.132.2.131:443 Ping failed with error HTTP2 ping cancelled
22/10/2023 22:37;D 2023-10-22T20:37:15.426Z | keepalive | (22) 10.132.2.131:443 Starting keepalive timer for 50000ms
22/10/2023 22:37;D 2023-10-22T20:37:15.426Z | keepalive | (22) 10.132.2.131:443 Received ping response
22/10/2023 22:37;D 2023-10-22T20:37:15.426Z | transport | (22) 10.132.2.131:443 session closed
22/10/2023 22:37;D 2023-10-22T20:37:15.448Z | dns_resolver | Resolved addresses for target dns:REDACTED_INTERNAL_HOSTNAME:443: [10.132.2.131:443,10.132.1.153:443]
22/10/2023 22:37;"D 2023-10-22T20:37:15.486Z | subchannel | (24) 10.132.1.153:443 Subchannel constructed with options { ""grpc.keepalive_time_ms"": 50000, ""grpc.keepalive_permit_without_calls"": 1}"
22/10/2023 22:37;D 2023-10-22T20:37:15.506Z | connectivity_state | (1) dns:REDACTED_INTERNAL_HOSTNAME:443 CONNECTING -> CONNECTING
22/10/2023 22:37;D 2023-10-22T20:37:15.506Z | resolving_load_balancer | dns:REDACTED_INTERNAL_HOSTNAME:443 CONNECTING -> CONNECTING
22/10/2023 22:37;D 2023-10-22T20:37:15.506Z | pick_first | IDLE -> CONNECTING
22/10/2023 22:37;D 2023-10-22T20:37:15.506Z | pick_first | Start connecting to subchannel with address 10.132.2.131:443
22/10/2023 22:37;D 2023-10-22T20:37:15.506Z | subchannel_refcount | (24) 10.132.1.153:443 refcount 1 -> 2
22/10/2023 22:37;D 2023-10-22T20:37:15.506Z | subchannel_refcount | (21) 10.132.2.131:443 refcount 1 -> 2
22/10/2023 22:37;D 2023-10-22T20:37:15.506Z | subchannel_refcount | (24) 10.132.1.153:443 refcount 0 -> 1
22/10/2023 22:37;D 2023-10-22T20:37:15.506Z | transport | dns:REDACTED_INTERNAL_HOSTNAME:443 creating HTTP/2 session to 10.132.2.131:443
22/10/2023 22:37;D 2023-10-22T20:37:15.506Z | subchannel | (21) 10.132.2.131:443 IDLE -> CONNECTING
22/10/2023 22:37;D 2023-10-22T20:37:15.508Z | connectivity_state | (2) dns:REDACTED_INTERNAL_HOSTNAME:443 CONNECTING -> CONNECTING
22/10/2023 22:37;D 2023-10-22T20:37:15.508Z | resolving_load_balancer | dns:REDACTED_INTERNAL_HOSTNAME:443 CONNECTING -> CONNECTING
22/10/2023 22:37;D 2023-10-22T20:37:15.508Z | pick_first | IDLE -> CONNECTING
22/10/2023 22:37;D 2023-10-22T20:37:15.508Z | subchannel_refcount | (24) 10.132.1.153:443 refcount 2 -> 3
22/10/2023 22:37;D 2023-10-22T20:37:15.508Z | subchannel_refcount | (21) 10.132.2.131:443 refcount 2 -> 3
22/10/2023 22:37;D 2023-10-22T20:37:15.508Z | dns_resolver | Resolved addresses for target dns:REDACTED_INTERNAL_HOSTNAME:443: [10.132.2.131:443,10.132.1.153:443]
22/10/2023 22:37;D 2023-10-22T20:37:15.509Z | subchannel_refcount | (24) 10.132.1.153:443 refcount 4 -> 3
22/10/2023 22:37;D 2023-10-22T20:37:15.509Z | subchannel_refcount | (21) 10.132.2.131:443 refcount 4 -> 3
22/10/2023 22:37;D 2023-10-22T20:37:15.509Z | subchannel_refcount | (24) 10.132.1.153:443 refcount 3 -> 4
22/10/2023 22:37;D 2023-10-22T20:37:15.509Z | subchannel_refcount | (21) 10.132.2.131:443 refcount 3 -> 4
22/10/2023 22:37;D 2023-10-22T20:37:15.509Z | connectivity_state | (1) dns:REDACTED_INTERNAL_HOSTNAME:443 CONNECTING -> CONNECTING
22/10/2023 22:37;D 2023-10-22T20:37:15.509Z | resolving_load_balancer | dns:REDACTED_INTERNAL_HOSTNAME:443 CONNECTING -> CONNECTING
22/10/2023 22:37;D 2023-10-22T20:37:15.509Z | pick_first | CONNECTING -> CONNECTING
22/10/2023 22:37;D 2023-10-22T20:37:15.509Z | subchannel_refcount | (24) 10.132.1.153:443 refcount 4 -> 3
22/10/2023 22:37;D 2023-10-22T20:37:15.509Z | subchannel_refcount | (21) 10.132.2.131:443 refcount 4 -> 3
22/10/2023 22:37;D 2023-10-22T20:37:15.509Z | subchannel_refcount | (24) 10.132.1.153:443 refcount 3 -> 4
22/10/2023 22:37;D 2023-10-22T20:37:15.509Z | subchannel_refcount | (21) 10.132.2.131:443 refcount 3 -> 4
22/10/2023 22:37;D 2023-10-22T20:37:15.510Z | resolving_load_balancer | dns:REDACTED_INTERNAL_HOSTNAME:443 CONNECTING -> CONNECTING
22/10/2023 22:37;D 2023-10-22T20:37:15.509Z | pick_first | CONNECTING -> CONNECTING
22/10/2023 22:37;D 2023-10-22T20:37:15.566Z | transport | (23) 10.132.1.153:443 session closed
22/10/2023 22:37;D 2023-10-22T20:37:15.510Z | connectivity_state | (2) dns:REDACTED_INTERNAL_HOSTNAME:443 CONNECTING -> CONNECTING
22/10/2023 22:37;D 2023-10-22T20:37:15.766Z | keepalive | (25) 10.132.2.131:443 Starting keepalive timer for 50000ms
22/10/2023 22:37;D 2023-10-22T20:37:15.786Z | subchannel_refcount | (21) 10.132.2.131:443 refcount 4 -> 3
22/10/2023 22:37;D 2023-10-22T20:37:15.786Z | subchannel_refcount | (21) 10.132.2.131:443 refcount 3 -> 4
22/10/2023 22:37;D 2023-10-22T20:37:15.786Z | pick_first | Pick subchannel with address 10.132.2.131:443
22/10/2023 22:37;D 2023-10-22T20:37:15.786Z | subchannel | (21) 10.132.2.131:443 CONNECTING -> READY
22/10/2023 22:37;D 2023-10-22T20:37:15.787Z | connectivity_state | (2) dns:REDACTED_INTERNAL_HOSTNAME:443 CONNECTING -> READY
22/10/2023 22:37;D 2023-10-22T20:37:15.787Z | resolving_load_balancer | dns:REDACTED_INTERNAL_HOSTNAME:443 CONNECTING -> READY
22/10/2023 22:37;D 2023-10-22T20:37:15.787Z | pick_first | CONNECTING -> READY
22/10/2023 22:37;D 2023-10-22T20:37:15.787Z | subchannel_refcount | (24) 10.132.1.153:443 refcount 2 -> 1
22/10/2023 22:37;D 2023-10-22T20:37:15.787Z | subchannel_refcount | (21) 10.132.2.131:443 refcount 4 -> 3
22/10/2023 22:37;D 2023-10-22T20:37:15.787Z | subchannel_refcount | (21) 10.132.2.131:443 refcount 3 -> 4
22/10/2023 22:37;D 2023-10-22T20:37:15.787Z | pick_first | Pick subchannel with address 10.132.2.131:443
22/10/2023 22:37;D 2023-10-22T20:37:15.787Z | connectivity_state | (1) dns:REDACTED_INTERNAL_HOSTNAME:443 CONNECTING -> READY
22/10/2023 22:37;D 2023-10-22T20:37:15.787Z | resolving_load_balancer | dns:REDACTED_INTERNAL_HOSTNAME:443 CONNECTING -> READY
22/10/2023 22:37;D 2023-10-22T20:37:15.787Z | pick_first | CONNECTING -> READY
22/10/2023 22:37;D 2023-10-22T20:37:15.786Z | subchannel_refcount | (24) 10.132.1.153:443 refcount 3 -> 2
22/10/2023 22:37;"2023-10-22T20:37:15.789Z	268d203b-b278-4376-8f06-1b3f467fe629	ERROR	[dd.trace_id=3303256522662915096 dd.span_id=6980096509857599035] Invoke Error 	{""errorType"":""Error"",""errorMessage"":""13 INTERNAL: Received RST_STREAM with code 2 (Internal server error)"",""code"":13,""details"":""Received RST_STREAM with code 2 (Internal server error)"",""metadata"":{},""level"":""error"",""lambda_unique_id"":""543e4bb7-5222-4292-9a2e-7997451b9244"",""lambda_boot_time"":""2023-10-22T19:58:58.662Z"",""stack"":[""Error: 13 INTERNAL: Received RST_STREAM with code 2 (Internal server error)"","" at callErrorFromStatus (/var/task/node_modules/@grpc/grpc-js/build/src/call.js:31:19)"","" at Object.onReceiveStatus (/var/task/node_modules/@grpc/grpc-js/build/src/client.js:192:76)"","" at Object.onReceiveStatus (/var/task/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:360:141)"","" at Object.onReceiveStatus (/var/task/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:323:181)"","" at /var/task/node_modules/@grpc/grpc-js/build/src/resolving-call.js:99:78"","" at process.processTicksAndRejections (node:internal/process/task_queues:77:11)"",""for call at"","" at ServiceClientImpl.makeUnaryRequest (/var/task/node_modules/@grpc/grpc-js/build/src/client.js:160:32)"","" at ServiceClientImpl. (/var/task/node_modules/@grpc/grpc-js/build/src/make-client.js:105:19)"","" at /opt/nodejs/node_modules/dd-trace/packages/datadog-instrumentations/src/grpc/client.js:159:27"","" at run (/opt/nodejs/node_modules/dd-trace/packages/diagnostics_channel/src/index.js:77:22)"","" at AsyncResourceStorage.run (/opt/nodejs/node_modules/dd-trace/packages/datadog-core/src/storage/async_resource.js:69:14)"","" at /opt/nodejs/node_modules/dd-trace/packages/diagnostics_channel/src/index.js:106:18"","" at Channel.runStores.ActiveChannelPrototype.runStores (/opt/nodejs/node_modules/dd-trace/packages/diagnostics_channel/src/index.js:86:12)"","" at callMethod (/opt/nodejs/node_modules/dd-trace/packages/datadog-instrumentations/src/grpc/client.js:149:23)"","" at ServiceClientImpl.wrapped [as authenticate] (/opt/nodejs/node_modules/dd-trace/packages/datadog-instrumentations/src/grpc/client.js:96:12)"","" at /var/task/api_key_authorizer_lambda.js:209:43""]}"
22/10/2023 22:37;"D 2023-10-22T20:37:15.907Z | transport | (25) 10.132.2.131:443 local settings acknowledged by remote: {""headerTableSize"":4096,""enablePush"":true,""initialWindowSize"":65535,""maxFrameSize"":16384,""maxConcurrentStreams"":4294967295,""maxHeaderListSize"":4294967295,""maxHeaderSize"":4294967295,""enableConnectProtocol"":false}"

dennisameling avatar Oct 23 '23 07:10 dennisameling

The GOAWAY is not the problem here. A GOAWAY with code 0 indicates "no error", i.e. that the client should stop creating new streams on the connection, but existing streams can still finish. The RST_STREAM with code 2 is "Internal server error", so in my opinion that indicates that there is a bug in the ALB. If the server is intentionally not processing that stream because it raced with the GOAWAY, it should respond with an RST_STREAM with code 7, which means "refused stream". In that case the gRPC library would transparently retry the request on a new connection, and you would never see the error. The ping failure just means that the ALB closed the connection, which is unsurprising if it told the client to go away and it closed the only open stream.

To summarize, I believe that the problem here is the RST_STREAM with error code 2, and that the ALB is behaving incorrectly by sending it.

murgatroid99 avatar Oct 23 '23 17:10 murgatroid99

Thanks once again for your quick reply and the context it provides. I'll reach out to our Account Manager at AWS to see if we can engage the ALB team somehow. I'd really like to figure out if this is indeed an ALB bug and if AWS can fix it. To be continued! 👀

As a workaround, I'm now just creating and closing gRPC clients on every Lambda invocation. So in the example I provided above, creating and closing 177 clients in a 70-minute timespan instead of maintaining 1 long-lived client for the entire 70 minutes. Not ideal because of the overhead, but we haven't seen the RST_STREAM issues since.

dennisameling avatar Oct 24 '23 12:10 dennisameling

The issue is opened since 2021 and still not solved?

We are also facing this issue while querying Google BQ projects from PowerBI windows Desktop app using Google BigQuery Connector.

I am getting this error for big projects which have large number of rows - Failed to save modifications to the server. Error returned: 'OLE DB or ODBC error: [DataSource.Error] ERROR [HY000] [Microsoft][BigQuery] (310) Storage API Error: Error: { code: UNKNOWN, message: Stream removed }. '. And I am not using ODBC driver but Google Big Query connector to connect to GCP Big Query. Though I have ODBC Simba driver installed in my VM.

PS - I think Google BQ connector internally uses Simba ODBC Driver.

mehulparmariitr avatar Apr 03 '24 10:04 mehulparmariitr

@mehulparmariitr This issue cannot be "solved" entirely because not every instance of this error is the fault of this library.

And it looks like you are not even getting this error at all. This issue is about errors with the code "INTERNAL" and the text "Received RST_STREAM with error code 2". Your error has the code "UNKNOWN" and the text "Stream removed". That error is in fact mostly solved in more recent versions of this library, so I recommend upgrading the dependency on @grpc/grpc-js if possible.

murgatroid99 avatar Apr 03 '24 16:04 murgatroid99

@murgatroid99 This issue in powerBI is related to RPC error also. I have seen few people reporting the same error attached to this public ticket and github issue https://issuetracker.google.com/issues/240305081

Unfortunately the fix needs to be implemented by Microsoft team in Google BigQuery connector in PowerBI to do retries.

mehulparmariitr avatar Apr 04 '24 05:04 mehulparmariitr