nodejs-firestore icon indicating copy to clipboard operation
nodejs-firestore copied to clipboard

RST_STREAM error keeps showing up

Open jakeleventhal opened this issue 5 years ago • 168 comments
trafficstars

Environment details

  • OS: macOS Catalina 10.15.5 Beta (19F53f)
  • Node.js version: 13.7.0
  • npm version: 6.13.6
  • @google-cloud/firestore version: 3.7.4

Steps to reproduce

This error keeps appearing over and over in my logs (not regularly reproducible):

Error: 13 INTERNAL: Received RST_STREAM with code 2
at Object.callErrorFromStatus (/api/node_modules/@grpc/grpc-js/src/call.ts:81)
at Object.onReceiveStatus (/api/node_modules/@grpc/grpc-js/src/client.ts:324)
at Object.onReceiveStatus (/api/node_modules/@grpc/grpc-js/src/client-interceptors.ts:439)
at Object.onReceiveStatus (/api/node_modules/@grpc/grpc-js/src/client-interceptors.ts:402)
at Http2CallStream.outputStatus (/api/node_modules/@grpc/grpc-js/src/call-stream.ts:228)
at Http2CallStream.maybeOutputStatus (/api/node_modules/@grpc/grpc-js/src/call-stream.ts:278)
at Http2CallStream.endCall (/api/node_modules/@grpc/grpc-js/src/call-stream.ts:262)
at ClientHttp2Stream.<anonymous> (/api/node_modules/@grpc/grpc-js/src/call-stream.ts:532)
at ClientHttp2Stream.emit (events.js:315)
at ClientHttp2Stream.EventEmitter.emit (domain.js:485)
at emitErrorCloseNT (internal/streams/destroy.js:76)
at processTicksAndRejections (internal/process/task_queues.js:84)

jakeleventhal avatar Apr 16 '20 00:04 jakeleventhal

@jakeleventhal Thanks for filing this. Do you know at all what type of request/API call this is related to?

schmidt-sebastian avatar Apr 17 '20 23:04 schmidt-sebastian

The only place i use grpc is via firestore so it must be something from there. I don't really have more info unfortunately

jakeleventhal avatar Apr 19 '20 04:04 jakeleventhal

I also got this one with firebase-admin 8.12.1 :

{"error":"Error: 13 INTERNAL: Received RST_STREAM with code 2
    at Object.callErrorFromStatus (/usr/src/app/node_modules/@grpc/grpc-js/build/src/call.js:30:26)
   at Object.onReceiveStatus (/usr/src/app/node_modules/@grpc/grpc-js/build/src/client.js:175:52)
    at Object.onReceiveStatus ...

whyvrafvr avatar May 15 '20 13:05 whyvrafvr

This is a weird jeopardize issue, thrown during delete as create or whatever request type...

whyvrafvr avatar May 15 '20 13:05 whyvrafvr

All document modifications (e.g. delete(), create()) use the same underlying request type. This request type is not safe for automatic retry, so unfortunately we cannot simply retry these errors.

schmidt-sebastian avatar May 18 '20 19:05 schmidt-sebastian

Hi @schmidt-sebastian. I do confirm the request type is not safe. Retrying many times the same call (and so, the same request, I guess) has no significant effect. But there is no error when the nodejs instance is restarted : I guess a GRPC subroutine is hanging out.

whyvrafvr avatar May 20 '20 18:05 whyvrafvr

After a while, errors, as mentioned by @jakeleventhal, are thrown and it’s not possible to persist data in firestore and an instance restart is required. That’s a real problem folks 😄 FYI, I’ve no issue using firestore delete(), create() methods with Go...

whyvrafvr avatar May 20 '20 18:05 whyvrafvr

@skonx To rule out that this is a problem with our backend, can you send me your project ID and the time at which you saw these repeated failures? My email is mrschmidt<at>google.com.

If it is not a backend issue, we likely need GRPC logs from to further diagnose these failures. These can be obtained by setting two environment variables: GRPC_TRACE=all, GRPC_VERBOSITY=DEBUG

Note that this create a lot of logs, so hopefully this is not something we have to look at.

schmidt-sebastian avatar May 20 '20 21:05 schmidt-sebastian

Yep. I've replied to your google.com email address. I will enable the environment variables asap and keep you post.

whyvrafvr avatar May 20 '20 23:05 whyvrafvr

Hello! We also experience the same error usually coupled with Error: 8 RESOURCE_EXHAUSTED: Bandwidth exhausted

We noticed that it usually fails all at once and then recovers very quickly.

ConorEB avatar May 26 '20 00:05 ConorEB

We are also affected by this error. A month before, it was a sporadic error, maybe once a week, and now we are seeing it many times per day.

sk- avatar May 26 '20 14:05 sk-

I'm also experiencing the issue for over a month now, with a couple hundred to a few thousand errors per day:

  • Error: 13 INTERNAL: Received RST_STREAM with code 2
  • Error: 14 UNAVAILABLE: Stream refused by server

@schmidt-sebastian I opened a support case if you need a project ID or other information: 23689958.

merlinnot avatar May 29 '20 08:05 merlinnot

Same problem here: Received RST_STREAM with code 2

zenati avatar May 29 '20 17:05 zenati

We were able to resolve the Error: 8 RESOURCE_EXHAUSTED: Bandwidth exhausted by upgrading to a nodejs12 runtime on App Engine and updating Electron to the newly released v9 version. So far we have not seen it again in production since the root cause was related to the node js patch: https://nodejs.org/en/blog/release/v12.14.1/

[9fd6b5e98b] - http2: fix session memory accounting after pausing (Michael Lehenbauer) #30684

After the update, the various other DB related errors have also gone away, including Received RST_STREAM with code 2.

We currently use Electron in a Google App Engine project and work in the standard runtime. Google has not updated its images to include some necessary libs for the new version of Electron. We were able to work around this as Google supports these libs in Puppeteer, so we installed Puppeteer and sent a deploy off to our production server (on a no-promote flag). After doing so, our best guess is that Google rebuilt our server images with the needed libs to run Puppeteer, which in turn allowed us to run the new version of Electron.

I hope this information helps! We spent a lot of time diving into this, so if you have any questions, feel free to respond below.

ConorEB avatar May 29 '20 18:05 ConorEB

Our backend team has looked at the projects that were sent to us. There are no indications that your errors correlate with errors on the backend, which makes it likely that this is a client side issue. I will continue to investigate possible root causes.

schmidt-sebastian avatar May 30 '20 04:05 schmidt-sebastian

As of v3.8.5, we are now retrying RunQuery requests that fail with RST_STREAM. If this error shows up again, we can evaluate expanding our retry to other SDK methods as well.

schmidt-sebastian avatar Jun 10 '20 17:06 schmidt-sebastian

@schmidt-sebastian I've been running on v3.8.5 for the entire day, I still see RST_STREAM.

I checked my code to see for which usage patterns does it occur:

  • await reference.set()
  • await batch.commit()
  • await reference.update()

Can we please reopen this issue for visibility?

merlinnot avatar Jun 10 '20 17:06 merlinnot

@merlinnot We continue to receive pushback on retrying RST_STREAM/INTERNAL for Commit RPCs. I hope your comment is enough to convince the respective folks that we should retry.

schmidt-sebastian avatar Jun 10 '20 17:06 schmidt-sebastian

Thank you. If needed, I can provide a histogram or any other details, just let me know.

merlinnot avatar Jun 10 '20 17:06 merlinnot

This is happening to a project I'm working on as well, using nodejs 10 runtime in Google Cloud functions (via firebase). I have four scripts that fail at least once a day - they make await reference.set() calls to firestore as the script executes. At the moment I don't have any retry logic. The scripts run every two minutes. They make at most 20 set calls.

ururk avatar Jun 11 '20 14:06 ururk

I was able to reproduce this with GRPC trace logging enabled. I hope this will allow us to make progress here:

2020-06-12T21:42:31.665Z | call_stream | [1518132] attachHttp2Stream from subchannel 172.217.214.95:443
2020-06-12T21:42:31.665Z | subchannel | 172.217.214.95:443 callRefcount 99 -> 100
2020-06-12T21:42:31.665Z | call_stream | [1518132] sending data chunk of length 130 (deferred)
2020-06-12T21:42:31.665Z | call_stream | [1518132] calling end() on HTTP/2 stream
Firestore (3.8.1) 2020-06-12T21:42:31.714Z bH7ni [WriteBatch.commit]: Sending 1 writes
Firestore (3.8.1) 2020-06-12T21:42:31.714Z bH7ni [ClientPool.acquire]: Creating a new client
Firestore (3.8.1) 2020-06-12T21:42:32.284Z ##### [Firestore]: Initialized Firestore GAPIC Client
Firestore (3.8.1) 2020-06-12T21:42:32.284Z bH7ni [Firestore.request]: Sending request: {"database":"projects/xyz/databases/(default)","writes":[{"update":{"name":"projects/xyz/database
s/(default)/documents/a/DpuiiezGapQVLrrdqFfo","fields":{}},"currentDocument":{"exists":false}}]}
Firestore (3.8.1) 2020-06-12T21:42:32.548Z G31QV [WriteBatch.commit]: Sending 1 writes
Firestore (3.8.1) 2020-06-12T21:42:32.548Z G31QV [ClientPool.acquire]: Re-using existing client with 99 remaining operations
Firestore (3.8.1) 2020-06-12T21:42:32.548Z G31QV [Firestore.request]: Sending request: {"database":"projects/xyz/databases/(default)","writes":[{"update":{"name":"projects/xyz/database
s/(default)/documents/a/Ua5mu1wRhNRExZLrX6K5","fields":{}},"currentDocument":{"exists":false}}]}
Firestore (3.8.1) 2020-06-12T21:42:32.599Z YsE7h [WriteBatch.commit]: Sending 1 writes
Firestore (3.8.1) 2020-06-12T21:42:32.599Z YsE7h [ClientPool.acquire]: Re-using existing client with 98 remaining operations
Firestore (3.8.1) 2020-06-12T21:42:32.599Z YsE7h [Firestore.request]: Sending request: {"database":"projects/xyz/databases/(default)","writes":[{"update":{"name":"projects/xyz/database
s/(default)/documents/a/EqrauWpiU7wXdJWnFJyb","fields":{}},"currentDocument":{"exists":false}}]}
...
020-06-12T21:42:33.833Z | call_stream | [1518033] HTTP/2 stream closed with code 2
2020-06-12T21:42:33.833Z | call_stream | [1518033] ended with status: code=13 details="Received RST_STREAM with code 2"
2020-06-12T21:42:33.833Z | subchannel | 172.217.214.95:443 callRefcount 100 -> 99
2020-06-12T21:42:33.833Z | call_stream | [1518034] HTTP/2 stream closed with code 2
2020-06-12T21:42:33.833Z | call_stream | [1518034] ended with status: code=13 details="Received RST_STREAM with code 2"
2020-06-12T21:42:33.833Z | subchannel | 172.217.214.95:443 callRefcount 99 -> 98
2020-06-12T21:42:33.833Z | call_stream | [1518035] HTTP/2 stream closed with code 2
2020-06-12T21:42:33.833Z | call_stream | [1518035] ended with status: code=13 details="Received RST_STREAM with code 2"
2020-06-12T21:42:33.833Z | subchannel | 172.217.214.95:443 callRefcount 98 -> 97
2020-06-12T21:42:33.833Z | call_stream | [1518036] HTTP/2 stream closed with code 2
2020-06-12T21:42:33.833Z | call_stream | [1518036] ended with status: code=13 details="Received RST_STREAM with code 2"
2020-06-12T21:42:33.833Z | subchannel | 172.217.214.95:443 callRefcount 97 -> 96
...
2020-06-12T21:42:33.838Z | subchannel | 172.217.214.95:443 callRefcount 1 -> 0
2020-06-12T21:42:33.839Z | subchannel | 172.217.214.95:443 connection closed with error read ECONNRESET
2020-06-12T21:42:33.839Z | subchannel | 172.217.214.95:443 READY -> IDLE
2020-06-12T21:42:33.839Z | subchannel | 172.217.214.95:443 refcount 2 -> 1
2020-06-12T21:42:33.839Z | pick_first | READY -> IDLE
2020-06-12T21:42:33.839Z | resolving_load_balancer | dns:firestore.googleapis.com:443 READY -> IDLE
2020-06-12T21:42:33.839Z | connectivity_state | dns:firestore.googleapis.com:443 READY -> IDLE
Firestore (3.8.1) 2020-06-12T21:42:33.842Z mQ1lw [Firestore.request]: Received error: { Error: 13 INTERNAL: Received RST_STREAM with code 2
    at Object.callErrorFromStatus (/home/mrschmidt/test/node_modules/@grpc/grpc-js/build/src/call.js:30:26)
    at Object.onReceiveStatus (/home/mrschmidt/test/node_modules/@grpc/grpc-js/build/src/client.js:175:52)
    at Object.onReceiveStatus (/home/mrschmidt/test/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:341:141)
    at Object.onReceiveStatus (/home/mrschmidt/test/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:304:181)
    at Http2CallStream.outputStatus (/home/mrschmidt/test/node_modules/@grpc/grpc-js/build/src/call-stream.js:116:74)
    at Http2CallStream.maybeOutputStatus (/home/mrschmidt/test/node_modules/@grpc/grpc-js/build/src/call-stream.js:155:22)
    at Http2CallStream.endCall (/home/mrschmidt/test/node_modules/@grpc/grpc-js/build/src/call-stream.js:141:18)
    at ClientHttp2Stream.stream.on (/home/mrschmidt/test/node_modules/@grpc/grpc-js/build/src/call-stream.js:403:22)
    at emitOne (events.js:116:13)
    at ClientHttp2Stream.emit (events.js:211:7)
  code: 13,
  details: 'Received RST_STREAM with code 2',
  metadata: Metadata { internalRepr: Map {}, options: {} } }

schmidt-sebastian avatar Jun 15 '20 17:06 schmidt-sebastian

I got the error once today, maybe it helps:

Error: 13 INTERNAL: Received RST_STREAM with code 2
    at Object.callErrorFromStatus (/srv/node_modules/@grpc/grpc-js/build/src/call.js:30:26)
    at Object.onReceiveStatus (/srv/node_modules/@grpc/grpc-js/build/src/client.js:175:52)
    at Object.onReceiveStatus (/srv/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:341:141)
    at Object.onReceiveStatus (/srv/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:304:181)
    at Http2CallStream.outputStatus (/srv/node_modules/@grpc/grpc-js/build/src/call-stream.js:115:74)
    at Http2CallStream.maybeOutputStatus (/srv/node_modules/@grpc/grpc-js/build/src/call-stream.js:154:22)
    at Http2CallStream.endCall (/srv/node_modules/@grpc/grpc-js/build/src/call-stream.js:140:18)
    at ClientHttp2Stream.stream.on (/srv/node_modules/@grpc/grpc-js/build/src/call-stream.js:396:22)
    at emitOne (events.js:116:13)
    at ClientHttp2Stream.emit (events.js:211:7)
    at emit (internal/http2/core.js:177:8)
    at _combinedTickCallback (internal/process/next_tick.js:142:11)
    at process._tickDomainCallback (internal/process/next_tick.js:219:9)
exports.onTicketMessageCreate = functions.firestore
  .document('tickets/{ticketId}/messages/{messageId}')
  .onCreate(async (snap, context) => {
    const message = snap.data();
    await snap.ref.parent.parent?.update({
      messages_count: increment,
      'dates.last_message': message?.dates.created,
    });
  });

leandroz avatar Jun 17 '20 12:06 leandroz

Might be a problem of grpc/grpc-js as I don't use nodejs-firestore and also face this error once in a while. It happens with stale connections

tommyjs007 avatar Jun 21 '20 05:06 tommyjs007

I'm also seeing very intermittent Received RST_STREAM with code 2, which in my case is followed by an endless pile of Failed to start HTTP/2 stream and Error: Exceeded maximum number of retries allowed. until the server process is terminated. I'm running nodejs 14 in a GKE pod. I'm not sure why the client can't automatically recover; it feels gross to set up a livenessprobe to kill the process just because of this issue.

danopia avatar Jun 25 '20 06:06 danopia

I filed b/160000476 (Google internal). I do hope that we can get some traction.

schmidt-sebastian avatar Jun 26 '20 17:06 schmidt-sebastian

@schmidt-sebastian Do you have some updates on the issue? On my side, just regular outages, nothing new 😂

merlinnot avatar Jul 01 '20 19:07 merlinnot

@schmidt-sebastian Do you have some updates on the issue? On my side, just regular outages, nothing new 😂

Same...

whyvrafvr avatar Jul 01 '20 19:07 whyvrafvr

Had the same issue today, Received RST_STREAM with code 2 followed by Failed to start HTTP/2 stream. Running on GKE, we had to recycle the pod to fix the outage

@google-cloud/firestore version 4.0.0

fsikansi avatar Jul 01 '20 22:07 fsikansi

Ran into the same Error: 13 INTERNAL: Received RST_STREAM with code 2 Running Node.js 10 in Google Cloud Functions. Was trying to delete a document from firestore. If it helps, the function is cron called via http. The only thing in the document was an array containing fifty elements. I have redeployed to Node.js 8 environment and have not seen the issue since. Has anyone else tried different node versions and experienced any success?

planetsanzio avatar Jul 02 '20 03:07 planetsanzio

@JustinBeckwith @murgatroid99 I saw your involvement in https://github.com/nodejs/help/issues/2105, which seems related at a first glance. Thought I'll tag you here in case it might be somewhat useful, especially if it's an issue with Node.js or @grpc/grpc-js.

I did more research on this topic with trace logs enabled for gRPC channels (don't blame me for killing Google Cloud Logging 😉).

What I see in my logs is:

  1. First, I see an abnormally high number of Pick result: QUEUE subchannel: undefined status: undefined undefined. Regularly, when the application is healthy, I usually see that next to dns:firestore.googleapis.com:443 createCall [251818] method="/google.firestore.v1.Firestore/Listen". Please see the two screenshots below, the first one is "regular" (application works fine) the second one is right before further errors happen.

    Screen Shot 2020-07-02 at 09 49 42 Screen Shot 2020-07-02 at 09 50 46
  2. Then I see an error thrown a couple of times: Error: 13 INTERNAL: Received RST_STREAM with code 2. FYI, "disconnecting" on the screenshot is my application's log statement, not related to the issue.

    Screen Shot 2020-07-02 at 09 52 50
  3. Then I can still see a couple (not a lot, maybe 20?) logs without errors:

    Screen Shot 2020-07-02 at 09 50 46
  4. Then I can see failures due to the session being destroyed, Failed to start call on picked subchanel 216.58.205.42:443 with error The session has been destroyed. Ending call. That error is logged ~200 times.

    Screen Shot 2020-07-02 at 09 57 09
  5. Next, it goes back to Pick result: QUEUE subchannel: undefined status: undefined undefined next to createCall.

    Screen Shot 2020-07-02 at 09 58 43
  6. Then again Pick result: QUEUE subchannel: undefined status: undefined undefined, but this time with no createCall. That happens a couple hundred times.

    Screen Shot 2020-07-02 at 10 01 02
  7. I can then see some Pick result: COMPLETE subchannel: 216.58.198.170:443 status: undefined undefined, maybe a few dozen.

    Screen Shot 2020-07-02 at 10 03 26
  8. It then starts happening next to createCall:

    Screen Shot 2020-07-02 at 10 05 31
  9. Next, it starts going between QUEUE/COMPLETE:

    Screen Shot 2020-07-02 at 10 07 11
  10. And then the worst starts, Failed to start call on picked subchannel 216.58.204.42:443 with error New streams cannot be created after receiving a GOAWAY. Retrying pick. It goes on forever until the node is restarted.

    Screen Shot 2020-07-02 at 10 20 28

For any Googler who'd like to examine the logs themselves:

  • Support case ID: 23689958.
  • Logging query: (resource.type="gce_instance" AND resource.labels.instance_id="1029783437173675621") OR (resource.type="global" AND jsonPayload.instance.id="1029783437173675621").
  • Timestamp where it all starts: 2020-07-02T06:28:41, you can also see timestamps in my screenshots above.

merlinnot avatar Jul 02 '20 08:07 merlinnot