grpc-node Critical performance bottleneck with high-throughput GRPC stream handling

Critical performance bottleneck with high-throughput GRPC stream handling

Open Threebow opened this issue 1 year ago • 12 comments

Problem description

Our current application utilizes this module within a Node.js backend environment, specifically designed to process events transmitted via a GRPC stream. The volume of these events is substantial, frequently reaching several thousands per second, and can potentially escalate to 10-15k events per second during peak periods.

Each event carries a timestamp indicating its original compilation time by the GRPC server. During our evaluations, we've identified a significant performance limitation with this module—it struggles to process beyond approximately 250 events per second. Moreover, a noticeable delay emerges rapidly, as indicated by logs comparing the current time against the event timestamps, leading to the processing of events that are significantly outdated, sometimes by several minutes.

This performance shortfall renders the task of managing such a high-volume stream through a single Node.js process impractical. Fortunately, our infrastructure includes powerful machines equipped with over 150 vcores and substantial RAM, enabling us to consider distributing the workload across multiple "consumer" sub-processes (be it through child_process, worker_threads, cluster.fork(), etc.) in a round-robin configuration.

Node.js's introduction of worker threads and the cluster module was a strategic enhancement to address such challenges, facilitating parallel request handling and optimizing multi-core processing capabilities. Given Node.js's proven capability to handle upwards of 20k transactions per second in benchmarks with frameworks like Express and Koa, it stands to reason that this scenario should be well within Node's operational domain.

However, it appears this module lacks support for such a distributed processing approach.

Inquiry

What is the optimal strategy for leveraging this module to handle thousands of events per second efficiently? Is there a method to employ Node.js's native cluster module to distribute the processing of these event transactions across multiple clustered instances in a round-robin manner, without duplicating events between processes?

Reproduction steps

Access a high-throughput GRPC stream.
Attempt to process this high-volume stream.
Observe significant delays in event processing, with events not being processed timely.

Code used to test throughput:

const client = new XClient("endpoint", ChannelCredentials.createInsecure())
const stream = client.SubscribeXUpdates(new SubscribeXUpdatesRequest())

let start: number | undefined
let n = 0

stream.on("data", () => {
	if (!start) start = Date.now()
	n++
})

stream.on("end", () => {
	console.error("Stream ended")
	process.exit(1)
})

stream.on("error", (error) => {
	console.error("Stream error", error)
	process.exit(1)
})

setTimeout(() => {
	if (!start) {
		console.error("No data received from the stream.")
		process.exit(1)
	}
	
	const msElapsed = Date.now() - start
	const sElapsed = msElapsed / 1000
	const rate = n / sElapsed
	
	console.log("intake rate:", rate, "transactions/sec")

	process.exit(0)
}, 30 * 1000)

Results after 30 seconds:

intake rate: 223.4826808496314 transactions/sec

While we acknowledge the challenge in replicating this specific scenario due to our event provider's closed-source nature, we can offer private access to our GRPC server endpoint and our protocol definitions for deeper investigation. Unfortunately, our ability to share further details is limited under these circumstances.

Environment

Operating System: Windows 11 (16c/32t, 64GB RAM)
Node.js Version: 20.10.1
Also appears on Ubuntu 22 server (160 cores, 350GB RAM), with same node version.
Docker is not being used.

Additional context

Should this module inherently be incapable of managing such high throughput, we suggest the inclusion of a disclaimer in the documentation to guide users with similar requirements, thereby preventing comparable challenges.

Feb 20 '24 18:02 Threebow

grpc-node grpc-node copied to clipboard

Critical performance bottleneck with high-throughput GRPC stream handling

Problem description

Inquiry

Reproduction steps

Environment

Additional context

grpc-node
grpc-node copied to clipboard