not sure why i'm seeing grpc blocking the event loop so often
Problem description
i tried to turn on event loop block detection, and all i see are grpc calls (we use google pubsub and other grpc-using tools), which doesn't make sense to me, since i thought during the call the system should yield
Reproduction steps
Use grpc on a busy server (we use google secrets manager, google pubsub, etc.)
Environment
- Linux Ubuntu 18.04 amd64
- Node 18
- nvm/yarn
Additional context
Examples:
7: " at /app/node_modules/@grpc/grpc-js/build/src/resolving-call.js:192:22"
8: " at process.processTicksAndRejections (node:internal/process/task_queues:95:5)"
message: "Event loop blocked for 600.4265230000019ms, operation started here:"
"4": " at Subchannel.startConnecting (/app/node_modules/@grpc/grpc-js/build/src/subchannel.js:278:17)",
"5": " at ChannelSubchannelWrapper.startConnecting (/app/node_modules/@grpc/grpc-js/build/src/subchannel-interface.js:34:20)",
"6": " at PickFirstLoadBalancer.exitIdle (/app/node_modules/@grpc/grpc-js/build/src/load-balancer-pick-first.js:341:24)",
"7": " at ChildLoadBalancerHandler.exitIdle (/app/node_modules/@grpc/grpc-js/build/src/load-balancer-child-handler.js:112:31)",
"8": " at ResolvingLoadBalancer.exitIdle (/app/node_modules/@grpc/grpc-js/build/src/resolving-load-balancer.js:235:32)",
"9": " at InternalChannel.getConfig (/app/node_modules/@grpc/grpc-js/build/src/internal-channel.js:313:36)",
"10": " at ResolvingCall.getConfig (/app/node_modules/@grpc/grpc-js/build/src/resolving-call.js:121:43)",
"11": " at /app/node_modules/@grpc/grpc-js/build/src/internal-channel.js:200:26",
"12": " at process.processTicksAndRejections (node:internal/process/task_queues:77:11)",
"message": "Event loop blocked for 99.32906599998473ms, operation started here:"
basically most of my event loop is blocked on grpc stuff. is there some configuration or setting to make it properly yield?
this is a common one... blocking for 600ms
"1": " at TLSWrap.emitInitNative (node:internal/async_hooks:200:43)",
"2": " at TLSSocket._wrapHandle (node:_tls_wrap:696:24)",
"3": " at new TLSSocket (node:_tls_wrap:577:18)",
"4": " at Object.connect (node:_tls_wrap:1752:19)",
"5": " at Object.connect (node:internal/http2/core:3298:22)",
"6": " at /app/node_modules/@grpc/grpc-js/build/src/transport.js:504:35",
"7": " at new Promise (<anonymous>)",
"8": " at Http2SubchannelConnector.createSession (/app/node_modules/@grpc/grpc-js/build/src/transport.js:417:16)",
"9": " at /app/node_modules/@grpc/grpc-js/build/src/transport.js:557:112",
"10": " at process.processTicksAndRejections (node:internal/process/task_queues:95:5)",
"message": "Event loop blocked for 699.5128900003433ms, operation started here:"
That last example shows the creation of a TLS socket. It's probably doing some encryption work, which takes time. This should only happen infrequently: one of the basic design principles of gRPC is that you establish a connection once and then use it to make several requests, so the time taken to establish the connection is amortized over all of the requests.
The other examples seem to have incomplete stack traces, so I don't think I can evaluate what happened with them.
i can show the full traces. the TLS establishment one happens a lot. i have a 10 minute log with a single instance running on google cloud. it happens 5000 times.
here's a full trace 400 non-io milliseconds to "sendMessageOnChild".
"data": [
{
"message": "Event loop blocked for 399.0506949999332ms, operation started here:"
},
[
" at AsyncHook.init (/app/node_modules/blocked-at/index.js:31:11)",
" at emitInitNative (node:internal/async_hooks:200:43)",
" at emitInitScript (node:internal/async_hooks:503:3)",
" at promiseInitHook (node:internal/async_hooks:322:3)",
" at promiseInitHookWithDestroyTracking (node:internal/async_hooks:326:3)",
" at Promise.then (<anonymous>)",
" at ResolvingCall.sendMessageOnChild (/app/node_modules/@grpc/grpc-js/build/src/resolving-call.js:104:99)",
" at /app/node_modules/@grpc/grpc-js/build/src/resolving-call.js:192:22"
]
heck here's the full log:
https://text.is/W88JY
you can see almost all the block-logs are "grpc doing stuff". not always TLS. not sure why TLS needs so much CPU with grpc, but not with our https:// rest requests - which never block.
I don't entirely understand how to interpret these logs. It's not clear what exactly qualifies as "blocking" the event loop. If it's any time spent in JS code, why are these stack traces called out specifically? And what exactly in the stack trace is the blocking time attributed to?
Also, if you have 5000 instances of blocking 400-600ms in 10 minutes, that's over half an hour of blocking total. I assume that means that there are multiple separate client instances. How many times does each client experience this in a 10 minute period?
- this is using the blocked_at library. it shows js stacks that are "blocking" the loop (not yielding)
- generally between 30 and 50 clients
- this is async code, it seems to me the grpc library should not be doing heavy computation or anything "in-band" (all i/o tasks should yield, not block), i suspect the issue is that some of the https cryptographic functions are not being offloaded to worker threads. but not sure.
The blocked-at README says that the stack traces it outputs point to the top stack frame in an asynchronous operation that blocks the event loop. So that doesn't directly tell us where the blocking actually happens, and it can be deep in the call stack from that indicated function. For example, the previously mentioned function ResolvingCall.sendMessageOnChild does very little computation on its own, but it passes the message down a call stack to eventually call http2Stream.write, and that that point, I don't control what happens.
I agree that the grpc library should not be doing any heavy computation in-band, and in fact I don't know of any case where it does that. The grpc library does not directly call any cryptographic functions, and it doesn't control how the http2 module interacts with the tls module.
I do find it weird that almost all of the reported blocking durations are almost exactly a clean multiple of 100ms. I can accept that it may be an artifact of how the blocked-at library works, but it makes me wonder what exactly it is measuring.
90% of my loop-blocked are now about 800-1000ms, and are these
6: " at Http2SubchannelConnector.connect (/app/node_modules/@grpc/grpc-js/build/src/transport.js:643:92)"
7: " at Subchannel.startConnectingInternal (/app/node_modules/@grpc/grpc-js/build/src/subchannel.js:154:14)"
8: " at Subchannel.transitionToState (/app/node_modules/@grpc/grpc-js/build/src/subchannel.js:204:22)"
9: " at /app/node_modules/@grpc/grpc-js/build/src/subchannel.js:295:23"
10: " at process.processTicksAndRejections (node:internal/process/task_queues:77:11)
the specific line of code is this
return (0, http_proxy_1.getProxiedConnection)(address, options, connectionOptions)
.then(result => this.createSession(address, credentials, options, result));
i'm suspecting that there's some sync dns-resolution or cryptography going on that i would wish was going on async or in a worker thread.
Neither getProxiedConnection nor createSession does any substantial synchronous work, they just call asynchronous Node APIs for establishing connections. There shouldn't be any substantial DNS resolution work happening, because those functions only connect to IP addresses, and DNS resolution is asynchronous in Node anyway.
Hi @murgatroid99, @earonesty,
We are seeing grpc server blocking the event loop while decompressing the payload
As I mentioned in https://github.com/grpc/grpc-node/issues/2753#issuecomment-2787538478, blocked-at shows the top-level frame in an asynchronous operation that blocks the event loop. The method decompressAndMaybePush decompresses asynchronously, but moves on synchronously if compression is not configured. Then it deserializes the message and passes it along to the message handler. If this is a unary method, this then calls the configured request handler for that method. So, the most likely thing to be blocking the event loop here is your request handler function.