js-libp2p-gossipsub
js-libp2p-gossipsub copied to clipboard
heartbeat performance issue
Motivation
In a contabo node of 50 peers, heartbeat function may take up to 800ms - 1s which block the event loop
part of https://github.com/ChainSafe/lodestar/issues/4002
Description
- Write performance test for
heartbeat()function - Improve
heartbeat()performance - If it takes too much time, yield to event loop every x milli seconds
@tuyennhv The resolution of the buckets in gossipsub_heartbeat_duration_seconds_bucket is really poor: [0.01, 0.1, 1]. So the quantile function can show widely inaccurate results. We should improve the bucket values and check this metric again. Since heartbeats are about 1s and we care about not looping around better bucket values would be
[0.01, 0.05, 0.1, 0.2, 0.4, 0.7, 1]
As gossipsub_heartbeat_duration_seconds_bucket doesn't have any labels and is called infrequently, more buckets are cheap

This chart is also a bit worrying: heartbeat should not be skipped, and should run 1/0.7 = 1.42 times / sec on average. Smooth chart below

The only hard data we have now is how many heartbeats take > 100ms to complete and how many take > 1000ms to complete. However you can see that many take > 100ms but almost 0 take > 1000 ms. But we don't really now the distribution between 100 and 1000 at all :shrug:

with new buckets [0.01, 0.05, 0.1, 0.2, 0.4, 0.7, 1], the distribution in a contabo-S (with rate interval as 5m) looks like
was able to catch a 775ms heartbeat, it shows that most of the time of heartbeat() function is due to emitGossip() call

0522_contabo_20_subscribe_all_subnets.cpuprofile.zip
right now we call emitGossip() 2 times in heartbeat() inside the fanout loop and mesh loop, and we may gossip messages to a peer twice in the same heartbeat
Update: metrics show that fanout map's size is consistently, however we still need to improve the logic
Why is emitGossip so costly? This needs benchmark testing. emitGossip is taking 2% of total CPU time

most of the cost is for looping through topic peers
and for a lodestar node connecting to all attestation subnets, there are about 70 topics in total => we call emitGossip() ~70 times per heartbeat
for the record, runHeartBeat() takes 3.6% of cpu time on a lodestar's subscribeAllSubnets node, should take another profile when all of the improvements are merged
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.
need to see if this still happens in the latest release (4.0.0 as of Aug 2022)
Heartbeat duration is improved significantly (v0.14.0 vs v4.1.1)
This metric is incredible in v4.1.1