grpc-node icon indicating copy to clipboard operation
grpc-node copied to clipboard

not sure why i'm seeing grpc blocking the event loop so often

Open earonesty opened this issue 1 year ago • 10 comments

Problem description

i tried to turn on event loop block detection, and all i see are grpc calls (we use google pubsub and other grpc-using tools), which doesn't make sense to me, since i thought during the call the system should yield

Reproduction steps

Use grpc on a busy server (we use google secrets manager, google pubsub, etc.)

Environment

  • Linux Ubuntu 18.04 amd64
  • Node 18
  • nvm/yarn

Additional context

Examples:

7: "    at /app/node_modules/@grpc/grpc-js/build/src/resolving-call.js:192:22"
8: "    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)"
message: "Event loop blocked for 600.4265230000019ms, operation started here:"
        "4": "    at Subchannel.startConnecting (/app/node_modules/@grpc/grpc-js/build/src/subchannel.js:278:17)",
        "5": "    at ChannelSubchannelWrapper.startConnecting (/app/node_modules/@grpc/grpc-js/build/src/subchannel-interface.js:34:20)",
        "6": "    at PickFirstLoadBalancer.exitIdle (/app/node_modules/@grpc/grpc-js/build/src/load-balancer-pick-first.js:341:24)",
        "7": "    at ChildLoadBalancerHandler.exitIdle (/app/node_modules/@grpc/grpc-js/build/src/load-balancer-child-handler.js:112:31)",
        "8": "    at ResolvingLoadBalancer.exitIdle (/app/node_modules/@grpc/grpc-js/build/src/resolving-load-balancer.js:235:32)",
        "9": "    at InternalChannel.getConfig (/app/node_modules/@grpc/grpc-js/build/src/internal-channel.js:313:36)",
        "10": "    at ResolvingCall.getConfig (/app/node_modules/@grpc/grpc-js/build/src/resolving-call.js:121:43)",
        "11": "    at /app/node_modules/@grpc/grpc-js/build/src/internal-channel.js:200:26",
        "12": "    at process.processTicksAndRejections (node:internal/process/task_queues:77:11)",
        "message": "Event loop blocked for 99.32906599998473ms, operation started here:"

basically most of my event loop is blocked on grpc stuff. is there some configuration or setting to make it properly yield?

earonesty avatar May 21 '24 01:05 earonesty

this is a common one... blocking for 600ms

        "1": "    at TLSWrap.emitInitNative (node:internal/async_hooks:200:43)",
        "2": "    at TLSSocket._wrapHandle (node:_tls_wrap:696:24)",
        "3": "    at new TLSSocket (node:_tls_wrap:577:18)",
        "4": "    at Object.connect (node:_tls_wrap:1752:19)",
        "5": "    at Object.connect (node:internal/http2/core:3298:22)",
        "6": "    at /app/node_modules/@grpc/grpc-js/build/src/transport.js:504:35",
        "7": "    at new Promise (<anonymous>)",
        "8": "    at Http2SubchannelConnector.createSession (/app/node_modules/@grpc/grpc-js/build/src/transport.js:417:16)",
        "9": "    at /app/node_modules/@grpc/grpc-js/build/src/transport.js:557:112",
        "10": "    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)",
        "message": "Event loop blocked for 699.5128900003433ms, operation started here:"

earonesty avatar May 21 '24 01:05 earonesty

That last example shows the creation of a TLS socket. It's probably doing some encryption work, which takes time. This should only happen infrequently: one of the basic design principles of gRPC is that you establish a connection once and then use it to make several requests, so the time taken to establish the connection is amortized over all of the requests.

The other examples seem to have incomplete stack traces, so I don't think I can evaluate what happened with them.

murgatroid99 avatar May 21 '24 17:05 murgatroid99

i can show the full traces. the TLS establishment one happens a lot. i have a 10 minute log with a single instance running on google cloud. it happens 5000 times.

here's a full trace 400 non-io milliseconds to "sendMessageOnChild".

      "data": [
        {
          "message": "Event loop blocked for 399.0506949999332ms, operation started here:"
        },
        [
          "    at AsyncHook.init (/app/node_modules/blocked-at/index.js:31:11)",
          "    at emitInitNative (node:internal/async_hooks:200:43)",
          "    at emitInitScript (node:internal/async_hooks:503:3)",
          "    at promiseInitHook (node:internal/async_hooks:322:3)",
          "    at promiseInitHookWithDestroyTracking (node:internal/async_hooks:326:3)",
          "    at Promise.then (<anonymous>)",
          "    at ResolvingCall.sendMessageOnChild (/app/node_modules/@grpc/grpc-js/build/src/resolving-call.js:104:99)",
          "    at /app/node_modules/@grpc/grpc-js/build/src/resolving-call.js:192:22"
        ]

heck here's the full log:

https://text.is/W88JY

you can see almost all the block-logs are "grpc doing stuff". not always TLS. not sure why TLS needs so much CPU with grpc, but not with our https:// rest requests - which never block.

earonesty avatar May 21 '24 17:05 earonesty

I don't entirely understand how to interpret these logs. It's not clear what exactly qualifies as "blocking" the event loop. If it's any time spent in JS code, why are these stack traces called out specifically? And what exactly in the stack trace is the blocking time attributed to?

Also, if you have 5000 instances of blocking 400-600ms in 10 minutes, that's over half an hour of blocking total. I assume that means that there are multiple separate client instances. How many times does each client experience this in a 10 minute period?

murgatroid99 avatar May 21 '24 21:05 murgatroid99

  1. this is using the blocked_at library. it shows js stacks that are "blocking" the loop (not yielding)
  2. generally between 30 and 50 clients
  3. this is async code, it seems to me the grpc library should not be doing heavy computation or anything "in-band" (all i/o tasks should yield, not block), i suspect the issue is that some of the https cryptographic functions are not being offloaded to worker threads. but not sure.

earonesty avatar Apr 08 '25 18:04 earonesty

The blocked-at README says that the stack traces it outputs point to the top stack frame in an asynchronous operation that blocks the event loop. So that doesn't directly tell us where the blocking actually happens, and it can be deep in the call stack from that indicated function. For example, the previously mentioned function ResolvingCall.sendMessageOnChild does very little computation on its own, but it passes the message down a call stack to eventually call http2Stream.write, and that that point, I don't control what happens.

I agree that the grpc library should not be doing any heavy computation in-band, and in fact I don't know of any case where it does that. The grpc library does not directly call any cryptographic functions, and it doesn't control how the http2 module interacts with the tls module.

I do find it weird that almost all of the reported blocking durations are almost exactly a clean multiple of 100ms. I can accept that it may be an artifact of how the blocked-at library works, but it makes me wonder what exactly it is measuring.

murgatroid99 avatar Apr 08 '25 20:04 murgatroid99

90% of my loop-blocked are now about 800-1000ms, and are these

6: "    at Http2SubchannelConnector.connect (/app/node_modules/@grpc/grpc-js/build/src/transport.js:643:92)"
7: "    at Subchannel.startConnectingInternal (/app/node_modules/@grpc/grpc-js/build/src/subchannel.js:154:14)"
8: "    at Subchannel.transitionToState (/app/node_modules/@grpc/grpc-js/build/src/subchannel.js:204:22)"
9: "    at /app/node_modules/@grpc/grpc-js/build/src/subchannel.js:295:23"
10: "    at process.processTicksAndRejections (node:internal/process/task_queues:77:11)

the specific line of code is this

return (0, http_proxy_1.getProxiedConnection)(address, options, connectionOptions)
  .then(result => this.createSession(address, credentials, options, result));

i'm suspecting that there's some sync dns-resolution or cryptography going on that i would wish was going on async or in a worker thread.

earonesty avatar Apr 10 '25 23:04 earonesty

Neither getProxiedConnection nor createSession does any substantial synchronous work, they just call asynchronous Node APIs for establishing connections. There shouldn't be any substantial DNS resolution work happening, because those functions only connect to IP addresses, and DNS resolution is asynchronous in Node anyway.

murgatroid99 avatar Apr 11 '25 16:04 murgatroid99

Hi @murgatroid99, @earonesty, We are seeing grpc server blocking the event loop while decompressing the payload Image

tanveergill avatar May 10 '25 20:05 tanveergill

As I mentioned in https://github.com/grpc/grpc-node/issues/2753#issuecomment-2787538478, blocked-at shows the top-level frame in an asynchronous operation that blocks the event loop. The method decompressAndMaybePush decompresses asynchronously, but moves on synchronously if compression is not configured. Then it deserializes the message and passes it along to the message handler. If this is a unary method, this then calls the configured request handler for that method. So, the most likely thing to be blocking the event loop here is your request handler function.

murgatroid99 avatar May 12 '25 20:05 murgatroid99