Overview

The gossipsub mesh parameters were originally chosen for message propagation speed and redundancy (a large number of faulty nodes shouldn't impact the message propagation too much). However, now that our networks have advanced, become stable and have a significant number of nodes, it might be time to re-visit the gossipsub parameters.

We (Lighthouse) have been looking into bandwidth consumption of our nodes and are in the process of doing some more analysis (which we'll share when we have it) and ways to reduce this without degrading our current networks.

The primary source of excessive bandwidth consumption is our gossipsub network which has a large degree of amplification on all our topics.

Potentially Low-Hanging Fruit

I've been monitoring the number of duplicates we have been receiving on various topics as well as the number of IWANT messages we have been sending. Both of these are good measures of redundant bandwidth. The goal, is to minimise these for each topic, without degrading the network.

A node typically sends IWANT messages when they receive an IHAVE from one of their peers for a message before they receive it from one of their mesh peers. This suggests the mesh is propagating slower or on the order of a node's peer's heartbeat interval. Messages can be delayed on the mesh for a number of reasons; message validation, network size, mesh connectivity, etc. Each client has their own speeds for validation, but also peer counts. A client that maintains more peer's is more likely to receive IHAVE's from one of their peers before their mesh (as the mesh size is bounded, regardless of peer count).

One suggestion to reducing IWANT/IHAVE is to either increase our gossipsub heartbeat_interval and/or decrease our D_lazy parameter to reduce the amount of gossiping each node does. These changes can however have an impact of message propagation latency, if nodes are currently relying on the gossip IWANT/IHAVE system to get their messages.

Message Amplification

We have set D to be 8 in the specs which naturally gives us relatively high message amplification. On the block topic (the simplest to reason about), we see on average about 8 duplicates (as would naively be expected) and therefore a bandwidth amplification of about 8x. The other topics are less straightforward.

We can significantly reduce our node's bandwidth by reducing the mesh sizes. The impact of this at a network-level is non-trivial. As we flood-publish, a client's default peer count plays a large role in how many hops or how fast a message propagates as we publish to every peer we know who is connected on that topic. Therefore, it may be possible to have higher peer counts, and lower mesh sizes whilst maintaining a similar number of hops to reach the entire network.

In any case our mesh sizes appear to be larger than needed and we may want to lower them (incrementally on testnets).

Proposal

I think it would be useful to put some lower bounds on the gossipsub mesh topic parameters. For lighthouse, we are considering having some of the parameters CLI tuneable. i.e users with no bandwidth limitation can have a high degree of connectivity (which provides them with lower message latency) and have some users with lower bandwidth but higher message latency. We were going to leave the default as is, so that most users will have no change to the current gossipsub settings.

However to do this safely, there needs to be some lower bound on the mesh sizes such that if all nodes choose the lowest setting, the network still has sufficient redundancy/amplification and message propagation speed.

Ultimately, I think we should try test some parameters on testnets (slowly) before moving towards mainnet. The parameters can be lowered slowly and iterated on over time.

I know that a full simulation/analysis of the live network would be ideal before changing/justifying these parameters, however I don't know of anyone that can/has modelled our current eth2 network and can definitively say how much effect adjusting these parameters across all nodes on the network will have. Our testnet's also don't have the same topology as mainnet, so tests there may not reflect on mainnet.

With that caveat aside, I'm suggesting we add some lower bounds to the gossipsub values, something like:

D : 5 (maybe even 4 as a lower bound)
D_low: 3
D_high: 6
D_lazy: 3 (or maybe 2)
heartbeat_interval : Maybe up to half a slot (clients can choose when they want to send gossip)

Note that these would be lower bounds and clients can maintain their current values and no work needs to be done. However we discuss and allow for some implementations to reduce these values.

We may also want to modify our implementations such that we can specify these on a per-topic basis, if more analysis is done and better values can be found per-topic.

Oct 25 '21 00:10 AgeManning

Two thoughts on D settings:

too low a range between D and Dhigh might lead to difficulties joining the mesh because the ratio of "full" nodes in the mesh will go up - we will probably want to ensure a healthy range here (maybe something like Dhigh = max(user_setting, D+Dlow)
too low Dlow and D affects not only latency but also increases the risk of forming disconnected islands

Oct 26 '21 07:10 arnetheduck

I'm very interested in tuning these params! Also in just getting better visibility to consider deeper beacon chain protocol or gossipsub changes to suit our needs.

Bandwidth curiosity

Do you have high level estimates on the numbers on the bandwidth each topic consumes? Primarily beacon_block vs a single attestation subnet.

I'm very curious the ratio to better decide where alternative optimization effort is best spent. E.g. I have a proposal for altering and enforcing attestation subnet subscriptions (to no longer have them scale with # of attached vals) but better data will help know where to push

Per-topic parametrization

We may also want to modify our implementations such that we can specify these on a per-topic basis, if more analysis is done and better values can be found per-topic.

Parametrizing on a per-topic-type basis also makes sense to me. Our delivery requirements are not the same on attestation subnets as they are with beacon blocks. That said, I'd want to see which of our subnets are really the data-hogs before suggesting an upstream gossipsub change to parametrize per topic.

Amplification factor looks bad, right?

we see on average about 8 duplicates (as would naively be expected) and therefore a bandwidth amplification of about 8x

Does this mean that on average, each of my 8 stable peers on a topic are sending me the block? Rather than caches and forwarding providing deduplication? Unless I'm missing something this implies something pretty deeply wrong, right? You'd naively want < D amplification because sometimes a peer will send to you and sometimes you'll send to them but most of the time, a peer will not do both for a given message.

D_lazy: 3 (or maybe 2)

Dropping D_lazy pretty low intuitively makes sense as I suspect that IWANT/IHAVE chatter is largely wasteful on our network (but don't have the data to back it up).

That said, it is very important in the event of attacks so getting rid of it altogether would not be wise. See heartbeat discussion below, but keeping on the order of 3/4 and greatly increasing heartbeat interval seems like a good compromise.

Heartbeat -- for fast delivery or attack resilience?

heartbeat_interval : Maybe up to half a slot (clients can choose when they want to send gossip)

I would consider altering this in a more mechanized way, rather than suggesting an upper bound and the network being mixed. If IWANT/IHAVE are largely wasted and sent every 0.7s, it'd be good to ascertain this is actually the case.

Note, we selected 0.7s after this PL v1.1. report. Recommendation copied here:

the Heartbeat setting of Gossipsub is currently at 1 sec. This is a reasonable setting which allows for responsiveness but avoids frantic changes in the protocol behaviour. Smaller values for the heartbeat will result in higher responsiveness (e.g. gossip is emitted at shorter time intervals), but at the expense of higher bandwidth requirements. The ETH2.0 community might want to consider reducing the heartbeat to 600-700 ms, especially with the 3-sec propagation deadline

The question seems to be -- is heartbeat gossip to ensure fast delivery or to protect against attacks and edgecases?

The 0.7s recommendation was made to ensure fast delivery -- i.e. MUST propagate in < 3s. Whereas while this propagation time is very important in the normal case, the network is resilient in the event of temporary high latencies (chain growth will continue and the core consensus mechanism will under most scenarios continue to finalize). Thus if we instead use the heartbeat gossip to only help defend against attacks rather than ensure timely delivery, altering heartbeat to be on the order of a sub-slot (2-4s) could be totally reasonable.

Intuitively, I'd go for 3s to 4s rather than 6s, giving the the network the chance to mend on the sub-slot sending boundaries.

Lower bounds vs changing the spec

Are you more interested in adding lower bounds to be more conservative about changing network topology? Or do you think there is value in some nodes having larger values than others?

My worry with just adding the lower bounds is that we leave a lot of network-wide savings on the table as most will just use the base defaults.

Oct 26 '21 21:10 djrtwo

Do you have high level estimates on the numbers on the bandwidth each topic consumes? Primarily beacon_block vs a single attestation subnet.

I don't have exact numbers, but will get these soon (awaiting a few more metric PR's to get better insights). Based on message count, beacon_block does (as you'd expect) around 8 messages per slot, the aggregate channel is around 2k (with some tuning around 1k). So there is a wide discrepancy is bandwidth. I'll get exact figures so as not to be hand-wavy with the others.

Parametrizing on a per-topic-type basis also makes sense to me. Our delivery requirements are not the same on attestation subnets as they are with beacon blocks. That said, I'd want to see which of our subnets are really the data-hogs before suggesting an upstream gossipsub change to parametrize per topic.

Agreed. Not sure if we'd have to change the gossipsub spec as its a more general change, but either way we can add it our implementation relatively easily (in a backwards compatible way)

Does this mean that on average, each of my 8 stable peers on a topic are sending me the block? Rather than caches and forwarding providing deduplication? Unless I'm missing something this implies something pretty deeply wrong, right? You'd naively want < D amplification because sometimes a peer will send to you and sometimes you'll send to them but most of the time, a peer will not do both for a given message.

As far as I can tell this is a natural consequence. I've tried to minimize this, but will have to adjust the gossipsub implementation. It mainly comes from triangle routing (I presume). Consider a mesh of 4. But all 4 are connected to each other. 1 gets a message from outside, and forwards to 2,3,4. 2 forwards to 3,4 also 3 forwards to 2,4 and 4 forwards to 2,3. In this example, 3 has received the message 3 times. The duplicate ratio also comes because our mesh is typically > than 8 as our D_high is 12 and our mesh on average is > 8 (from what I've seen). I've modified the implementation to ensure none of our peers ever send us a duplicate (knowingly) and this holds true, so I've deduced the triangle routing scenario as the main culprit for large number of duplicates. As Jacek has mentioned, one modification we can make is to keep track of duplicates sent to us from our peers so that we don't propagate to peers that have sent us a duplicate, this could eliminate one of the extra forwards in the example I gave. i.e 4 may not forward to 2,3 if it had registered the duplicate messages from 2,3 earlier.

It's not as bad on the aggregate topic channel when there's a lot more messages, but let me come back with some more concrete data.

The question seems to be -- is heartbeat gossip to ensure fast delivery or to protect against attacks and edgecases? The 0.7s recommendation was made to ensure fast delivery -- i.e. MUST propagate in < 3s. Whereas while this propagation time is very important in the normal case, the network is resilient in the event of temporary high latencies (chain growth will continue and the core consensus mechanism will under most scenarios continue to finalize). Thus if we instead use the heartbeat gossip to only help defend against attacks rather than ensure timely delivery, altering heartbeat to be on the order of a sub-slot (2-4s) could be totally reasonable.

The heartbeat does a number of things, mainly mesh maintenance and score. You're probably right that we don't want to make this too large and rather just reduce the amount of gossip via d_lazy. I dont know enough about how it would effect the network dynamics having a larger value, but intuitively, adding/dropping peers from meshes and emitting gossip every 0.7 secs seems a bit fast for some of our topics, especially ones with large numbers of messages when we might want to accept some delay and rely more on our mesh. A small heartbeat also adds bandwidth in control messages as we're constantly GRAFT and PRUNEing peers every 0.7 seconds for all our meshes (topics).

Are you more interested in adding lower bounds to be more conservative about changing network topology? Or do you think there is value in some nodes having larger values than others? My worry with just adding the lower bounds is that we leave a lot of network-wide savings on the table as most will just use the base defaults.

My thinking was to make it configurable for two reasons.

So I can test and monitor metrics on a single node, to see how it behaves. (Irrelevant to the current discussion)
Users (if they want) can increase the values (which can only help the network) and will still be spec-compliant.

So I guess, maybe the lower bounds are the default, but people can help the network and chew more bandwidth, get better message latency and help the network if they want to and have limited data restrictions. Not opposed to having fixed values either.

Oct 27 '21 01:10 AgeManning

As Jacek has mentioned, one modification we can make is to keep track of duplicates sent to us from our peers so that we don't propagate to peers that have sent us a duplicate

Meaning, I take information about past messages to not duplicate route for future messages? I'd worry that due to the mesh dynamically changing slot-to-slot that this could create gaps in natural delivery. I'd need to think about it some more.

It does seem worth reducing the amplification factor if possible.

An alternative might be to gossip headers to peers we expect might do duplicates and the peer only requests the body if it doesn't hear about it from another source. But such mechanisms are a deeper change to what we have today.

You're probably right that we don't want to make this too large and rather just reduce the amount of gossip via d_lazy

I'm pro doing both! I do think there is a compelling argument to alter this to even 2s or 3s. That is from the perspective of the IWANT/IHAVE gossip -- if we use this gossip as more of a fallback rather than ensuring super fast delivery, 2 or 3s begins to look reasonable. That said as you noted, it also is tied to adding/dropping peers but doing that super fast also seems... a waste.

Oct 27 '21 20:10 djrtwo

It does seem worth reducing the amplification factor if possible.

It's important to remember that what the amplification factor primarily does is give the message a good spread at the very beginning, when almost no peers have the block/attestation/etc - it comes with a cost as well of course, but it's not "just" resilience against partitions that gets affected.

might be to gossip headers

Well, this is by and large what IHAVE is :)

subnets

Subnets are pretty much full these days: there's as many committees as there are subnets, every slot, which means that subscribing to a subnet "too early" is really bad, from a bandwidth perspective - what we're looking to optimize right now for few-validators nodes is to shorten the time before aggregation duty that we subscribe to the topic, so as to avoid irrelevant (for that aggreate) attestation traffic. The flip side of being more aggressive here are features like backoff where unsubscribing from a topic prevents you from resubscribing for several slots - this is an issue because the same subnet might be needed fairly soon after an aggregation duty (because randomness).

Numbers would be extremely valuable at this stage, of relative bandwidth costs of various topics and the metadata around them - it's something that's moving up on our prio list given that altair measurably increases cost (cpu/network) of running a node

Oct 27 '21 20:10 arnetheduck

Meaning, I take information about past messages to not duplicate route for future messages? I'd worry that due to the mesh dynamically changing slot-to-slot that this could create gaps in natural delivery. I'd need to think about it some more.

I think this change is safe and wont effect any natural propagation. Essentially we currently store a list of seen message ids. If someone sends us a message with the same id, we ignore it. The proposed change would be to not only store the message-id from peers, but also the peers that have sent us this exact message. Then when we propagate/publish, we check that a peer hasn't already sent us this exact message. If a peer has sent us this exact message, its a waste for us to send it to them.

I know we're going down a proper simulation route with formal numbers. I was trialling mesh sizes around 4 or 5 and saw around a 1/6 reduction in bandwidth, to give very rough figures. These was on a node subscribed to all subnets all the time (on prater) however. So there are decent gains to be made.

Oct 27 '21 22:10 AgeManning

The proposed change would be to not only store the message-id from peers, but also the peers that have sent us this exact message. Then when we propagate/publish, we check that a peer hasn't already sent us this exact message

wow, I've naively/blindly just assumed this even though I've never seen it written in spec or discussed in implementations.

I thought you meant to use past messages to imply what to do with future messages (which seems unsafe). What you describe seems safe, natural, and insane (?) not to do.

Oct 27 '21 23:10 djrtwo

Yeah, I think it hasn't been done because it wasn't clear how much this would save us vs the overhead of storing every peer for every duplicate.

I think in our case, with all the duplicates we have, there would be a decent amount of bandwidth saving. However we'd have to get all the implementations to implement it for it to take effect at a network level :). I'll add it into rust-gossipsub now.

Oct 28 '21 00:10 AgeManning

wow, I've naively/blindly just assumed this even though I've never seen it written in spec or discussed in implementations.

This is only relevant when there's a large enough delay between receiving the message and stamping it "ok" for further distribution - for networks that don't validate messages like we do, or where validation is synchronous, it isn't relevant. The cost of it is memory usage - instead of having a single "seen" cache, you end up with one per mesh peer.

I suspect the greatest benefit might be on subnet and aggregate channels where there's are bursts of small data packets with fairly high validation time (vs transmission time).

What's nice about it is that it works well to "naturally" alleviate processing in times of stress - when the client / network is less loaded, (almost) no messages will be dropped by this feature, but under load, it kicks in and filters out some redundancy - the more the load, the more it filters.

Oct 28 '21 07:10 arnetheduck

This is only relevant when there's a large enough delay between receiving the message and stamping it "ok" for further distribution - for networks that don't validate messages like we do, or where validation is synchronous, it isn't relevant. The cost of it is memory usage - instead of having a single "seen" cache, you end up with one per mesh peer.

What if use a counter that bumps every new message received and gives a message an id that grows from one message to another. And a couple of dictionaries that maps 1) counter id on the message id 2) counter id on the peer id. And send messages to peers which counter id is less than a counter id of a message. Though, it would require sending messages in a strict order to be sure that all X messages starting with counter id N-X ending with N have been sent to a peer

Oct 28 '21 08:10 mkalinin

This is only relevant when there's a large enough delay between receiving the message and stamping it "ok" for further distribution - for networks that don't validate messages like we do, or where validation is synchronous, it isn't relevant. The cost of it is memory usage - instead of having a single "seen" cache, you end up with one per mesh peer.

I think this is relevant even without message validation (although message validation probably does make it worse, it just increases message latency in-effect). In my small example of an interconnected mesh of size 4, regardless of message validation, on the second hop, we could always remove sending back a duplicate with this method (provided nodes aren't sending to each other in-flight, its a race-condition on which one we would filter). I think with more complex meshes where messages bounce around in a triangle a bit more we'd get more savings.

What if use a counter that bumps every new message received and gives a message an id that grows from one message to another. And a couple of dictionaries that maps 1) counter id on the message id 2) counter id on the peer id. And send messages to peers which counter id is less than a counter id of a message. Though, it would require sending messages in a strict order to be sure that all X messages starting with counter id N-X ending with N have been sent to a peer

This is an interesting idea. I guess it would save us some RAM compared to naively storing the peers per message. Also it may be difficult to send in strict order because we do the message validation async and some objects can take longer to validate than others.

Oct 28 '21 23:10 AgeManning

Message Amplification We have set D to be 8 in the specs which naturally gives us relatively high message amplification. On the block topic (the simplest to reason about), we see on average about 8 duplicates (as would naively be expected) and therefore a bandwidth amplification of about 8x. The other topics are less straightforward.

We can significantly reduce our node's bandwidth by reducing the mesh sizes. The impact of this at a network-level is non-trivial. As we flood-publish, a client's default peer count plays a large role in how many hops or how fast a message propagates as we publish to every peer we know who is connected on that topic. Therefore, it may be possible to have higher peer counts, and lower mesh sizes whilst maintaining a similar number of hops to reach the entire network.

In any case our mesh sizes appear to be larger than needed and we may want to lower them (incrementally on testnets).

On the topic of message amplification has there been any thought given to tuning the gossipsub D parameters to be per topic rather than global ? Some topics naturally have much more duplicates simply because message volumes are much smaller for that topic. Parameters that would work well for beacon blocks might be a negative if applied to aggregates or any other noisier channel.

Tuning it per topic, would allow us to tune the parameters to what expected message rates are in a topic, rather than applying smaller mesh sizes across the board. I understand that this will be a non-trivial change implementation wise to all the gossipsub libraries, but since we are looking at ways to reduce message amplification this is one direction to go in.

Oct 29 '21 03:10 nisdas

Some topics naturally have much more duplicates

How is message volume related to duplication?

Oct 30 '21 17:10 arnetheduck

How is message volume related to duplication?

Duplication might be the wrong word, but more IHAVE messages relaying the same message ids.

Oct 31 '21 16:10 nisdas

The proposed change would be to not only store the message-id from peers, but also the peers that have sent us this exact message. Then when we propagate/publish, we check that a peer hasn't already sent us this exact message. If a peer has sent us this exact message, its a waste for us to send it to them.

Another navie way could be to append the message with the last few overlays prefix's (first 2 bytes maybe) that the message has travelled and ignore retransmitting to similar overlays. This method can still have duplicates, but tight triangular duplications can be avoided. How much the amplification is reduced is based on how many prefix we choose to send along with the messageID.

ofcourse this has a disadvantage that peers will know to some extent from which direction the message is coming from.

Nov 08 '21 06:11 jmozah

I think this is relevant even without message validation

How so? Presumably, without validation you receive a message and rebroadcast it to everyone without delay - ie there's no "gap" where nodes can send you messages and thus signal that they have it already - unless one is introduced artificially.

overhead of storing

Just thought a little bit about this, and the overhead should actually be negligible - we only need to store this information for messages that are in-validation - basically, the first time we see the message and send it off to validation, we register it for monitoring - then, when others send it as well, we record that, then when validation is done, we send it to anyone that wasn't in the monitoring record and remove the information completely. Subsequent messages will be discarded by the "seen" cache regardless and won't ever enter the validation pipeline so there's no overhead for them.

counter that bumps

This is an interesting idea, but it also introduces state that must be maintained across across the two clients, which I think adds a complication that we should strive to avoid, exhausting other possibilities first.

Nov 08 '21 07:11 arnetheduck

How so? Presumably, without validation you receive a message and rebroadcast it to everyone without delay - ie there's no "gap" where nodes can send you messages and thus signal that they have it already - unless one is introduced artificially.

Yep you're right. Its only in the validation gap this needs to be dealt with. Your suggestion is good. I've modified my implementation as to how we are storing the peers. Shifted it to the mem-cache and its a bit cleaner and should reduce the memory footprint.

Nov 09 '21 04:11 AgeManning

Gossipsub parameter tuning

Overview

Potentially Low-Hanging Fruit

Message Amplification

Proposal

Bandwidth curiosity

Per-topic parametrization

Amplification factor looks bad, right?

Heartbeat -- for fast delivery or attack resilience?

Lower bounds vs changing the spec