snarkOS icon indicating copy to clipboard operation
snarkOS copied to clipboard

[Feature] Improve transaction propagation

Open vicsn opened this issue 2 months ago • 3 comments

🚀 Feature

A core feature of Narwhal is that transmissions may be included in multiple proposals.

image

The current setup of snarkOS has an overreliance on safety, because clients and validators propagate valid seen transmissions to all of their peers, incurring compute, bandwidth, and generally a lot of latency. The graph below shows that certificate generation slows down significantly under load. Previous measurements have shown that this in turn is caused by nodes waiting on transmission fetching.

Image

We shouldn't get rid of the propagation entirely either. When running a load test with 6000 transmissions on reference hardware and no propagation at all, sometimes all transmissions would land within a few rounds, but more often only around 5750 would land, which is an indication some older certificates get left behind under stress and we need at least some propagation.

To improve network throughput, I propose the following:

  • We add a propagate: bool field to struct UnconfirmedTransaction
  • When clients receive a transaction via /broadcast/transaction, broadcast to all peers with propagate: true
  • When clients receive a transaction via the P2P network, they broadcast to all peers with propagate: false, but only if propagate: true on the incoming message
  • When validators receive a transaction via /broadcast/transaction, broadcast to all validators with propagate: false.
  • When validators receive a transaction via the P2P network, they broadcast to all peers with propagate: false, but only if propagate: true on the incoming message. Receiving validators should immediately add the transmission into cache_transmissions so they don't have to fetch it.
  • In order to ensure all transmissions land, validators periodically - say every PRIMARY_PING_IN_MS - include transmissions which they have not seen in any proposal/certificate/ledger yet from cache_transmissions. We may want to tackle this last point only after https://github.com/ProvableHQ/snarkOS/issues/3961 is done. This can also be done based on the validator index.

The above is also applicable to solutions.

Note that the above approach to optimistic broadcast only works as when validator's router's are well-connected. With large networks and peer limits of 21, that may not be the case, so we may want to also move transmission broadcasts to the Gateway.

vicsn avatar Oct 23 '25 09:10 vicsn

Update: the test in 6b0d0cee0 is very promising, the gaps between block generation have mostly disappeared:

Image

vicsn avatar Oct 28 '25 14:10 vicsn

The following case is problematic: Consider two clients (C1 and C2) and one validator V. C1 is connected only to C2, and V is connected to C2 (and other validators, but that is not important).

C1 <-> C2 <-> V

Now, someone submits a transaction to C1. C1 then forwards it to C2 with (propagate: true). At this point, C2 will not forward the transaction anymore, and it will never reach validators. A short-term "fix" for this would be that clients, not connected to any validators, will always propagate transactions.

A long-term fix is to implement a proper gossip protocol. Nodes periodically send an "inventory" message that contains hashes of all new transactions they received since the last inventory message (similar to inventory vectors in Bitcoin). Nodes send an inventory message after receiving a certain number of new unconfirmed transactions (e.g. 10), or after some timer expires (e.g. 50ms). Peers can then request a set of transactions from the node that sent the inventory message. This would avoid sending the full serialized transactions to all peers.

kaimast avatar Dec 13 '25 20:12 kaimast

and it will never reach validators.

Yeah you're right, I have a slightly different issue in mind but the core of it is that my limited propagation does not work.

A client router does not validate signatures or anything when a peer advertises that it's a validator. So we shouldn't change behaviour based on this untrusted data.

Letting clients just always propagate should always work, and may already be what we do in the code today. If in the future we want to reduce bandwidth at the expense of latency we could propagate to X% of peers...

vicsn avatar Dec 14 '25 18:12 vicsn