aw icon indicating copy to clipboard operation
aw copied to clipboard

Incorrect usage of timeouts in gossip

Open rahulghangas opened this issue 3 years ago • 4 comments

While gosspping, two different timeouts are used - defined by context supplied by user - defined by value of field in the GossipOptions struct

The first one defies the total time for a round of gossipping, while the second one defines the timeout for sending each message. When a sync message is received, the gossipper tries to propagate the new message by re-gossipping it. However, the gossipper now has no notion of the timeout supplied by the user (and that particular context has probably gone out scope!). Currently we use the second timeout to a call to Gossip, but the gossipping might fail if the first attempt to message a peer fails (since that uses the same timeout)

rahulghangas avatar Feb 27 '21 21:02 rahulghangas

@rahulghangas Not sure I fully understand the issue/implication of the dual timeouts. Do you mind listing out a concrete example?

jazg avatar Mar 01 '21 05:03 jazg

The call to gossip has the following signature

func (g *Gossiper) Gossip(ctx context.Context, contentID []byte, subnet *id.Hash)

where the context provided usually has a timeout of, say 5-10 seconds. On the other hand, if you look at didReceiveSync

func (g *Gossiper) didReceiveSync(from id.Signatory, msg wire.Msg) {
...
...
...
ctx, cancel := context.WithTimeout(context.Background(), g.opts.Timeout)
defer cancel()

g.Gossip(ctx, msg.Data, &subnet)

after receiving the sync message, we regossip the message, but using the internal timeout. The internal timeout is supposed to be used to define the time limit of sending a single message, not a whole round of gossipping.

rahulghangas avatar Mar 01 '21 05:03 rahulghangas

@rahulghangas Got it. Could we send to the recipients in parallel inside the Gossip function to resolve this? I will check with @loongy to gauge the intended usage of the timeout.

jazg avatar Mar 01 '21 05:03 jazg

That could potentially solve the issue, and the call to gossip wouldn't require a context as input since we'll use the internal timeout to define a context and use it on all messages (in parallel)

rahulghangas avatar Mar 01 '21 05:03 rahulghangas