go-libp2p-pubsub icon indicating copy to clipboard operation
go-libp2p-pubsub copied to clipboard

Offloading messages for async validation

Open raulk opened this issue 6 years ago • 18 comments

From @arnetheduck (Nimbus, ETH 2.0 client):

@raulk we discussed topic validation in libp2p as a way to prevent bad information from spreading across the gossipsub network, though from what I can tell, the block propagation filtering method in libp2p that you pointed me to is synchronous (https://github.com/libp2p/go-libp2p-pubsub/blob/bfd65a2f6b810c5b4ad2cfe6bb9cc792fd7a0171/floodsub_test.go#L360). this might not sit well with block validation, where we might want to prevent gossiping a block until we've verified it based on data that we receive later. how would you recommend we approach this?

here's the scenario in detail:

  • over gossip, we receive a block whose parent we're missing
  • worst case, this means we cannot yet tell if it's a good / useful block or not
  • we don't want the block to be gossiped further until we've recovered its parent to ensure that it's sane. once we do know it's sane, we want to pass it on.

To summarise:

  1. Validation can be costly or not feasible in some scenarios to perform sync.
  2. Is it feasible to consume the message, do validation offline, then republish it? How does that affect message caches, duplicate detection across the network (e.g. if we send the message to peers who had already seen it -- and possibly even propagated it if they had more complete data than us), do we generate a new message ID?
  3. What are the differences on the wire between publishing a message afresh, and spreading a gossiped message?

In a nutshell: is it possible to offload a message from the pubsub router for async validation, then resume its gossiping conditionally?

raulk avatar Mar 18 '19 15:03 raulk

Validation is run asynchronously in a background goroutine.

vyzo avatar Mar 18 '19 16:03 vyzo

@vyzo the concern is not with blocking the gossip thread. The use case is that validation of message M is co-dependant on other messages M’ that could’ve arrived previously, but may have not. If they didn’t, the client can pull them from the network. That process can cause validation of message M to lengthen to seconds or more. All the while, Gossipsub has a 150ms validation timeout, and also a throttling gadget.

Would you mind addressing the questions above so we can all gain more clarity on this scenario? Thanks.

raulk avatar Mar 18 '19 17:03 raulk

With the current implementation it's not possible. With quite a bit of work it may be possible.

vyzo avatar Mar 18 '19 17:03 vyzo

Ok, so the validator would have to fail when it enters the non-deterministic scenario. We’d need a callback for failed validations, so that those messages can be processed separately.

Once we’re able to validate the message, we’d have to republish it. What’s the trade-off in terms of amplification and dedup? (It’s still the same message)

raulk avatar Mar 18 '19 17:03 raulk

It's a rather complex change to implement. The trade off is that the message propagation would be very slow, as it wouldn't be forwarded until it could be validated.

vyzo avatar Mar 18 '19 18:03 vyzo

I think that tradeoff is known and accepted. They basically want nodes to forward only messages whose correctness can be verified against past state (e.g. one block depends on its parent). Since they’re async and eventually consistent, it’s possible that gossiped stuff arrives out of order. Also it’s possible that gossips never arrive, correct?

That’s ok. I’m more worried about the extra amplification, as the message cache could’ve slid before the message is republished and therefore it could reach the entire network again as gossipsub wouldn’t dedup, they’d have to dedup in their logic.

When you publish a message, can you force the original message ID?

raulk avatar Mar 18 '19 19:03 raulk

Re dedup, I don't think any sane eth2 client will rely on libp2p-level dedup - we have a block merkle root by which we identify the payload, both when requesting them and when receiving them from the network - this root is persistent across sessions.

I'd regard that part of the protocol as a nice-to-have optimization, nothing else. In fact, I find it hard to imagine an application that relies on once-only ordered delivery on top of a gossip setting and is correct at the same time.

Perhaps the right thing to do here is simply not to broadcast the message again. It's kind of natural that broadcasts are ephemeral, and trying to get that behavior from a gossip network goes against its grain somewhat.

It does raise an interesting question: how would a sat-link connection with high latency affect the system? How is the cache timeout tuned? the problem can happen naturally, in the wild, as well.

arnetheduck avatar Mar 18 '19 19:03 arnetheduck

I’m talking about dedup insofar controlling amplification is concerned @arnetheduck. This is important to prevent cycling.

raulk avatar Mar 18 '19 20:03 raulk

(Of course apps should ensure idempotency when relying on pubsub.)

raulk avatar Mar 18 '19 21:03 raulk

I’m talking about dedup insofar controlling amplification is concerned @arnetheduck. This is important to prevent cycling.

yeah, sorry for being unclear there: that's what I was alluding to with the sat-link question - how is the anti-cycling tuned with respect to high-latency links?

arnetheduck avatar Mar 18 '19 21:03 arnetheduck

Right now it's not adaptive. We should explore this case together ;-) @arnetheduck

raulk avatar Mar 21 '19 12:03 raulk

Copying over from the ethresearch/p2p Gitter thread:

Kevin Mai-Husan Chia @mhchia 12:07
We can use Validator to validate received content and return a boolean to tell
the pubsub to relay it or not. IMO in the simple cases the current structure
is enough for our usage. However, as the situation pointed out in the
discussion, later blocks might be received before the previous blocks. Then
the Validator run for those "orphan blocks" will be blocked, and the
Validators will time out. Even without the timeout, the number of the
concurrent Validators might go too large.

Raúl Kripalani @raulk 12:11
@mhchia thanks for rescuing that thread! a change to make validation async
would be welcome. it wouldn’t be too difficult. there’s already an
abstraction for datastores, so you would inject the datastore into the
pubsub router, and have it persist messages it is unable to validate
instantaneously, then spawn the validation job and report the result to
the router later. We’d need some form of GC to drop persisted messages
after a grace period, if the validation result never arrived.

raulk avatar Apr 03 '19 11:04 raulk

By popular petition, we need to take this up, see #172. I have a design in mind which I’ll post later as I’m on mobile now.

raulk avatar Apr 11 '19 08:04 raulk

An async validator feature could look like this:

type AsyncValidationResult struct {
    msg     *pubsub.Message
    result  error
}

type AsyncValidator interface {
    // Queue queues a message for future validation. If error is nil, the implementation promises to 
    // validate the message and return the result in the supplied channel at a later time. 
    //
    // The async validator is responsible for offloading the message from memory when
    // appropriate. It can use a Datastore or some other medium for this.
    Queue(ctx context.Context, msg *pubsub.Message, resp chan<- AsyncValidationResult) error
}

We'd need to work out how offloading a message would impact message caches and sliding windows.

raulk avatar Apr 11 '19 10:04 raulk

The seen cache would be most severely impacted, as messages can be rebroadcast into the network way after the 120s cache duration. We need to consider the effects of this.

vyzo avatar Apr 11 '19 11:04 vyzo

In terms of structure, we can add an api for forwarding prepared messages (ie messages published by someone else, already signed). This way we can offload the message for async validation. When the validator has completed, it can forward the message using the new api.

vyzo avatar Apr 11 '19 11:04 vyzo

#176 supports long-running validators in the simplest possible manner: It removes the default (short) timeout and allows the validators to run arbitrarily long without any need for api changes or complex contraptions.

vyzo avatar Apr 26 '19 08:04 vyzo

Note that you need to adjust the time cache duration accordingly.

On the other hand there is still a use case for completely offline validators, which could take days to complete.

vyzo avatar Apr 26 '19 09:04 vyzo