js-libp2p-gossipsub icon indicating copy to clipboard operation
js-libp2p-gossipsub copied to clipboard

heartbeat performance issue

Open twoeths opened this issue 3 years ago • 11 comments
trafficstars

Motivation

In a contabo node of 50 peers, heartbeat function may take up to 800ms - 1s which block the event loop

Screen Shot 2022-05-27 at 16 59 20

part of https://github.com/ChainSafe/lodestar/issues/4002

Description

  • Write performance test for heartbeat() function
  • Improve heartbeat() performance
  • If it takes too much time, yield to event loop every x milli seconds

twoeths avatar May 27 '22 10:05 twoeths

@tuyennhv The resolution of the buckets in gossipsub_heartbeat_duration_seconds_bucket is really poor: [0.01, 0.1, 1]. So the quantile function can show widely inaccurate results. We should improve the bucket values and check this metric again. Since heartbeats are about 1s and we care about not looping around better bucket values would be

[0.01, 0.05, 0.1, 0.2, 0.4, 0.7, 1]

As gossipsub_heartbeat_duration_seconds_bucket doesn't have any labels and is called infrequently, more buckets are cheap

dapplion avatar May 28 '22 07:05 dapplion

Screenshot from 2022-05-28 14-12-48

This chart is also a bit worrying: heartbeat should not be skipped, and should run 1/0.7 = 1.42 times / sec on average. Smooth chart below

Screenshot from 2022-05-28 14-14-43

dapplion avatar May 28 '22 07:05 dapplion

The only hard data we have now is how many heartbeats take > 100ms to complete and how many take > 1000ms to complete. However you can see that many take > 100ms but almost 0 take > 1000 ms. But we don't really now the distribution between 100 and 1000 at all :shrug:

Screenshot from 2022-05-28 14-26-19 Screenshot from 2022-05-28 14-26-36

dapplion avatar May 28 '22 07:05 dapplion

with new buckets [0.01, 0.05, 0.1, 0.2, 0.4, 0.7, 1], the distribution in a contabo-S (with rate interval as 5m) looks like

Screen Shot 2022-06-01 at 15 36 48

twoeths avatar Jun 01 '22 08:06 twoeths

was able to catch a 775ms heartbeat, it shows that most of the time of heartbeat() function is due to emitGossip() call Screen Shot 2022-06-02 at 09 49 48

0522_contabo_20_subscribe_all_subnets.cpuprofile.zip

right now we call emitGossip() 2 times in heartbeat() inside the fanout loop and mesh loop, and we may gossip messages to a peer twice in the same heartbeat

Update: metrics show that fanout map's size is consistently, however we still need to improve the logic

twoeths avatar Jun 02 '22 03:06 twoeths

Why is emitGossip so costly? This needs benchmark testing. emitGossip is taking 2% of total CPU time

Screenshot from 2022-06-02 12-23-10

dapplion avatar Jun 02 '22 05:06 dapplion

most of the cost is for looping through topic peers

Screen Shot 2022-06-02 at 14 09 58

and for a lodestar node connecting to all attestation subnets, there are about 70 topics in total => we call emitGossip() ~70 times per heartbeat

twoeths avatar Jun 02 '22 07:06 twoeths

for the record, runHeartBeat() takes 3.6% of cpu time on a lodestar's subscribeAllSubnets node, should take another profile when all of the improvements are merged

Screen Shot 2022-06-15 at 10 21 05

0522_contabo_20_subscribe_all_subnets.cpuprofile.zip

twoeths avatar Jun 15 '22 03:06 twoeths

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Aug 13 '22 14:08 stale[bot]

need to see if this still happens in the latest release (4.0.0 as of Aug 2022)

twoeths avatar Aug 14 '22 12:08 twoeths

Heartbeat duration is improved significantly (v0.14.0 vs v4.1.1)

Screen Shot 2022-08-25 at 10 17 56

twoeths avatar Aug 25 '22 03:08 twoeths

This metric is incredible in v4.1.1

Screen Shot 2022-09-28 at 09 50 12

twoeths avatar Sep 28 '22 02:09 twoeths