fcl-js icon indicating copy to clipboard operation
fcl-js copied to clipboard

Occasional GRPC crashes with testnet transactions

Open gta14 opened this issue 3 years ago • 14 comments

Problem

"@onflow/fcl": "1.0.2", "@onflow/transport-grpc": "1.0.1"

Hi, this is Todd at GigLabs. When executing transactions on testnet via FCL,we are getting intermittent errors. Not all the time, but enough that it's causing problems for our Ops and QA teams trying to test things. No reports so far of this on mainnet. All of these calls are in a try/catch block in our code, but it seems like our node process experiences a hard crash in sdk-send-grpc.js. Can you help me figure out 1) the cause of this error and 2) how the exception can be caught? Right now we have no visibility into what happened and we pretty much have to restart our server. Presumably this would allow our app to access statusMessage. I have tried in the past to switch from GRPC to HTTP, but that didn't allow me to query events reliably (other random errors). So we are still using GRPC. Thank you!

/usr/src/portal-api/node_modules/@onflow/transport-grpc/dist/sdk-send-grpc.js:24
2022-09-15 12:19:13reject(new Error(statusMessage));
2022-09-15 12:19:13^
2022-09-15 12:19:13Error
2022-09-15 12:19:13at Object.onEnd (/usr/src/portal-api/node_modules/@onflow/transport-grpc/dist/sdk-send-grpc.js:24:22)
2022-09-15 12:19:13at /usr/src/portal-api/node_modules/@improbable-eng/grpc-web/dist/grpc-web-client.js:1:24233
2022-09-15 12:19:13at /usr/src/portal-api/node_modules/@improbable-eng/grpc-web/dist/grpc-web-client.js:1:11490
2022-09-15 12:19:13at Array.forEach (<anonymous>)
2022-09-15 12:19:13at e.rawOnError (/usr/src/portal-api/node_modules/@improbable-eng/grpc-web/dist/grpc-web-client.js:1:11452)
2022-09-15 12:19:13at e.onTransportHeaders (/usr/src/portal-api/node_modules/@improbable-eng/grpc-web/dist/grpc-web-client.js:1:9025)
2022-09-15 12:19:13at NodeHttp.responseCallback (/usr/src/portal-api/node_modules/@improbable-eng/grpc-web-node-http-transport/lib/index.js:31:22)
2022-09-15 12:19:13at Object.onceWrapper (node:events:514:26)
2022-09-15 12:19:13at ClientRequest.emit (node:events:394:28)
2022-09-15 12:19:13at ClientRequest.emit (node:domain:470:12)
2022-09-15 12:19:13at HTTPParser.parserOnIncomingClient (node:_http_client:621:27)
2022-09-15 12:19:13at HTTPParser.parserOnHeadersComplete (node:_http_common:127:17)
2022-09-15 12:19:13at TLSSocket.socketOnData (node:_http_client:487:22)
2022-09-15 12:19:13at TLSSocket.emit (node:events:394:28)
2022-09-15 12:19:13at TLSSocket.emit (node:domain:470:12)
2022-09-15 12:19:13at addChunk (node:internal/streams/readable:312:12)

Steps to Reproduce

Happens randomly when doing things like minting NFTs on testnet. Seems to be more frequent when running multiple transactions in parallel with different keys.

Context

It becomes a dev ops problem for us, and it blocks our ops and QA teams from doing their jobs on testnet in preparation for production NFT drops.

gta14 avatar Sep 15 '22 17:09 gta14

For quick fix on "hard crash" , you can try:

process.on('uncaughtException', (err, origin) => {
    console.log(err);
});

How are the request counts? Any chance hitting the rate limit ?

bluesign avatar Sep 15 '22 17:09 bluesign

I was trying to run 40 minting transactions in parallel. According to the docs the limit for send transaction is 2000/sec. LMK if I misinterpreted that. We have been doing batches of 40 for a long time, and just started seeing issues last week. Is there a technical reason for the way errors are handled (or not)? I'm curious why fcl/sdk doesn't catch exceptions and/or return a more descriptive error.

gta14 avatar Sep 15 '22 18:09 gta14

yeah 40 shouldn't be a problem actually, maybe you can try updating the FCL. I remember some exception problems talk before, but it was about http api part mostly.

bluesign avatar Sep 15 '22 18:09 bluesign

Ok, I can try that. But it was working with this same version until recently. I will experiment with uncaughtException but does anyone know why fcl doesn't return the error message from the GRPC call, if any? If we could get a message like "rate limit exceeded" or "sequence number mismatch" it would be a big help in debugging things like this.

gta14 avatar Sep 15 '22 19:09 gta14

@gta14 I checked now, seems error handling fixed on @onflow/[email protected] (not sure if fixes this case) , probably if you update the fcl it will be ok

bluesign avatar Sep 15 '22 19:09 bluesign

Thanks @bluesign The FCL upgrade is an option, but it would require a full regression of our entire system, which is not great for us timing-wise right now. But I will try it. Is there anyone who can help identify the root cause of this error? Could it be a problem with testnet access node(s)?

gta14 avatar Sep 15 '22 20:09 gta14

Hi @gta14 , could you try limiting your requests to 5-10/sec or lower? This will help us in debugging this issue to see if this is an issue due to rate limiting.

JeffreyDoyle avatar Sep 15 '22 21:09 JeffreyDoyle

@JeffreyDoyle I tried 1 at a time yesterday and the errors cleared. I think I also tried 10 and got the error. But let me repeat and see if I can find a specific number. I'd really like to understand why fcl can't handle this error, if anyone can comment on that :-). My fcl calls are in a try/catch block but nothing is ever caught, it just crashes. Maybe I'm doing something wrong, just looking to understand what is going on.

gta14 avatar Sep 15 '22 21:09 gta14

@JeffreyDoyle We tried minting 5 at a time and saw some errors. Still in the area of Object.onEnd (/usr/src/portal-api/node_modules/@onflow/transport-grpc/dist/sdk-send-grpc.js:24

gta14 avatar Sep 15 '22 22:09 gta14

@JeffreyDoyle I was able to capture this error. It suggests that aggressive rate limiting is being done. Is this something that can be corrected via configuration on your end? Not sure how cloudflare is used here, but I do see ATL in the response and most of us are in Atlanta ("cf-ray": ["74b9dd2c0f03ad31-ATL"])

{
	"status": 8,
	"statusMessage": "",
	"headers": {
		"headersMap": {
			"date": ["Fri, 16 Sep 2022 13:14:13 GMT"],
			"content-length": ["0"],
			"connection": ["close"],
			"x-envoy-ratelimited": ["true"],
			"x-ratelimit-limit": ["1, 1;w=1"],
			"x-ratelimit-remaining": ["0"],
			"x-ratelimit-reset": ["1"],
			"x-request-id": ["5cd77cce577c1dfe357550c7c827be0c"],
			"via": ["1.1 google"],
			"cf-cache-status": ["DYNAMIC"],
			"strict-transport-security": ["max-age=15552000; includeSubDomains"],
			"x-content-type-options": ["nosniff"],
			"server": ["cloudflare"],
			"cf-ray": ["74b9dd2c0f03ad31-ATL"]
		}
	},
	"message": null,
	"trailers": {
		"headersMap": {
			"date": ["Fri, 16 Sep 2022 13:14:13 GMT"],
			"content-length": ["0"],
			"connection": ["close"],
			"x-envoy-ratelimited": ["true"],
			"x-ratelimit-limit": ["1, 1;w=1"],
			"x-ratelimit-remaining": ["0"],
			"x-ratelimit-reset": ["1"],
			"x-request-id": ["5cd77cce577c1dfe357550c7c827be0c"],
			"via": ["1.1 google"],
			"cf-cache-status": ["DYNAMIC"],
			"strict-transport-security": ["max-age=15552000; includeSubDomains"],
			"x-content-type-options": ["nosniff"],
			"server": ["cloudflare"],
			"cf-ray": ["74b9dd2c0f03ad31-ATL"]
		}
	}
}

gta14 avatar Sep 16 '22 13:09 gta14

Hi @gta14 , thank you for raising this! We're actively looking into this and will report back here when we have an update.

JeffreyDoyle avatar Sep 16 '22 18:09 JeffreyDoyle

@JeffreyDoyle The Flow team fixed something and it is working now. Didn't get any details. @bluesign Just to confirm, you are saying that in the latest version of FCL this error can be caught by my app instead of crashing it? These type of crashes have always been a big problem for us and I would like to understand why it happens and the proper way to fix. You can see where I added the comment and console.log below.

var unary = function unary(host, method, request, context) {
  try {
    utilInvariant.invariant(context.config, "SDK GRPC Unary Error: context.config must be defined.");
    return Promise.resolve(context.config().get("grpc.metadata", {})).then(function (metadataFromConfig) {
      return new Promise(function (resolve, reject) {
        grpcWeb.grpc.unary(method, {
          request: request,
          host: host,
          metadata: new grpcWeb.grpc.Metadata(metadataFromConfig),
          onEnd: function onEnd(_ref) {
            var status = _ref.status,
                statusMessage = _ref.statusMessage,
                message = _ref.message;

            if (status === grpcWeb.grpc.Code.OK) {
              resolve(message);
            } else {
              // CAN MY APP CATCH THIS?
              console.log(JSON.stringify(_ref));
              reject(new Error(statusMessage));
            }
          }
        });
      });
    });
  } catch (e) {
    return Promise.reject(e);
  }
};

gta14 avatar Sep 16 '22 18:09 gta14

Hi @gta14 there was a devnet AN with a misconfigured rate limit, which is why you were seeing those low limits occasionally. It should be fixed now.

I noticed in the description that you were previously having issues using the REST endpoint. If/when you have cycles to try that again, it'd be awesome if you could open another issue to debug. We definitely want to make sure both are working as expected.

peterargue avatar Sep 16 '22 19:09 peterargue

@peterargue We are sticking with grpc for now, but if we change I will let you know if there are any issues. When I tried HTTP before we weren't doing anything special, just querying events from NFT contracts.

gta14 avatar Sep 19 '22 13:09 gta14