iroh icon indicating copy to clipboard operation
iroh copied to clipboard

feat(iroh): Gossip client

Open rklaehn opened this issue 1 year ago • 9 comments

Description

This makes gossip available in iroh. Using iroh-gossip directly, while not horrible, is a bit verbose.

Breaking Changes

Not sure. It exports a few more things, and adds things. But in theory it should not modify existing things.

Notes & open questions

  • There can be some scenarios where this can cause trouble. E.g. when subscribing and then unsubscribing to a topic that is also used for doc sync.
  • Gossip messages are deduped it seems. So if you send the exact same message twice it will be dropped. If that's how it work we must document it.
  • I see lots of errors like the following:
2024-05-01T13:16:20.430596Z ERROR sync{me=kkpbi4eb5h6jcdjz}:sync{me=kkpbi4eb5h6jcdjz}: iroh::sync_engine::gossip: Failed to process gossip event namespace=atqlwoptbmnd724j err=Serde Deserialization Error

It seems that the sync engine thinks that it "owns" gossip, so it uses subscribe_all and then will deserialize everything, even messages that are not for it. How can we avoid that? Using an entirely different ALPN for normal gossip?

  • What should be the behaviour when a subscriber lags? Close it or just send Lagged and continue?

I think that relying on guaranteed delivery for gossip is not a good idea in any case. so maybe just sending a Lagged but then continuing is best? Forcing the client to resubscribe could be a bit annoying.

Change checklist

  • [ ] Self-review.
  • [ ] Documentation updates if relevant.
  • [ ] Tests if relevant.
  • [ ] All breaking changes documented.

rklaehn avatar May 01 '24 13:05 rklaehn

I think it is safe to just ignore any messages that are not for us. See https://github.com/n0-computer/iroh/pull/2258/commits/b88398e743935c5975571e8b4fcb54c929ddc5e8

Obviously you can mess with things by subscribing to a namespace that is currently being synced, but doing so accidentally is highly unlikely. And you can also mess with other stuff by using the low level blobs api, so 🤷 .

rklaehn avatar May 01 '24 14:05 rklaehn

It seems that due to the fact that the rpc protocol is public, we can not ever do an additional rpc request without breaking compat.

--- failure constructible_struct_adds_field: externally-constructible struct adds field ---

Description:
A pub struct constructible with a struct literal has a new pub field. Existing struct literals must be updated to include the new field.
        ref: https://doc.rust-lang.org/reference/expressions/struct-expr.html
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.31.0/src/lints/constructible_struct_adds_field.ron

Failed in:
  field Iroh.gossip in /home/runner/work/iroh/iroh/iroh/src/client.rs:44

--- failure enum_variant_added: enum variant added on exhaustive enum ---

Description:
A publicly-visible enum without #[non_exhaustive] has a new variant.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#enum-variant-new
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.31.0/src/lints/enum_variant_added.ron

rklaehn avatar May 01 '24 15:05 rklaehn

we need to mark the rpc enums as non exhaustive, that should solve this

dignifiedquire avatar May 01 '24 20:05 dignifiedquire

longterm I wish the rpc types where all private anyway

dignifiedquire avatar May 01 '24 20:05 dignifiedquire

It seems that the sync engine thinks that it "owns" gossip, so it uses subscribe_all

@Frando can we change this? seems unfortunate to subscribe to everything by default

dignifiedquire avatar May 03 '24 10:05 dignifiedquire

It seems that the sync engine thinks that it "owns" gossip, so it uses subscribe_all

@Frando can we change this? seems unfortunate to subscribe to everything by default

I think we can, yes. What is our primitive of choice these days to merge a changing list of streams?

with futures and thus its merge combinators phased out in iroh - maybe https://docs.rs/futures-concurrency/latest/futures_concurrency/stream/trait.Merge.html ?

Frando avatar May 03 '24 12:05 Frando

  • https://docs.rs/futures-lite/latest/futures_lite/stream/trait.StreamExt.html#method.race

dignifiedquire avatar May 03 '24 13:05 dignifiedquire

I did not yet give the new gossip dispatcher module a real review, but my first thought is: What about we move the gossip_dispatcher from iroh to iroh-gossip? And make it the main interface. Because then we don't need broadcast channels at all anymore. And save one channel with its buffers. The sync engine would then use the gossip dispatcher, as would the RPC handlers. And if we get things right, less code - because the sync engine also dances around the different state (but your dispatcher has the nicer impl with the state enum).

Frando avatar May 03 '24 22:05 Frando

#2265 changes the sync engine usage of gossip to not use subscribe_all (based on this branch). However as outlined in the previous comment I think I'd prefer to move the new gossip_dispatcher module to iroh_gossip, and make both the node's gossip client and the node's sync engine use that dispatcher.

Frando avatar May 06 '24 08:05 Frando

@Frando so the basic idea of the gossip dispatcher is just to have a thing that wraps the gossip and does some management so that it is less fragile to use. It does the join/unjoin etc. for you and deals with laggy subscribers.

If you agree with / can live with the public interface (which is just the subscribe function), we could move something with that signature into iroh-gossip as a higher level interface and then I would use it here, and you could update docs to use it (or not...).

The interface is basically this:

pub fn subscribe(
    &self,
    msg: GossipSubscribeRequest,
    updates: UpdateStream,
) -> impl Stream<Item = RpcResult<GossipSubscribeResponse>> {

where UpdateStream is just a stream of either broadcast or broadcast_neighbours:

/// Send a gossip message
#[derive(Serialize, Deserialize, Debug)]
pub enum GossipSubscribeUpdate {
    /// Broadcast a message to all nodes in the swarm
    Broadcast(Bytes),
    /// Broadcast a message to all direct neighbors
    BroadcastNeighbors(Bytes),
}

There is a hardcoded constant somewhere about how much you are allowed to lag before you are thrown out. We could make that configurable as well.

rklaehn avatar Jun 05 '24 11:06 rklaehn

I think the API is fine. And still very much in favor of having only one and not 2 high level dispatch interfaces to gossip.

Frando avatar Jun 05 '24 14:06 Frando

Ok, I will clean up the dispatcher a bit and make sure it is independent of the rpc protocol, in preparation for moving it into iroh-gossip.

rklaehn avatar Jun 05 '24 14:06 rklaehn

Moved the dispatcher to iroh-gossip and moved the types into the dispatcher file.

I reexported some of the dispatcher types instead of making newtypes for them in the rpc protocol. Not 100% sure about this, but making so many single element newtypes is a drag...

One question is if the dispatcher should present a simplified interface for creation. Currently it takes a Gossip. But I don't really see how this can be made really nice. Take an endpoint and an options object?

rklaehn avatar Jun 05 '24 14:06 rklaehn

This is missing adapting the docs stuff to the new gossip dispatcher. Other than that I think it is good to go.

rklaehn avatar Jun 06 '24 09:06 rklaehn

Also at least a smoke test for the gossip client API would be good so that we don't break things.

Frando avatar Jun 07 '24 09:06 Frando

@Frando I have added a test. It seems to work, now that I have added the ability to tell the endpoint about node addrs (see https://github.com/n0-computer/iroh/pull/2433 ).

I think this kind of test is nice, because it forces us to make the client API sufficiently rich that you can do tests with it. Also it is much less likely to fluke out than a full cli type tests.

But - do you think this test for gossip will reliably work? I am connecting two nodes and then send a msg from one to the next. Can there be a situation where I send the message too early and it gets lost?

rklaehn avatar Jul 01 '24 12:07 rklaehn

But - do you think this test for gossip will reliably work? I am connecting two nodes and then send a msg from one to the next. Can there be a situation where I send the message too early and it gets lost?

Yes, Gossip::join is awaited here, and that waits for at least one neighbor to come up before completing.

So the remaining TODO is to merge the two dispatchers (in dispatcher.rs and net.rs) so that messages flow only over a single channel and not two. We should do this, but IMO it can also happen in a followup.

Frando avatar Jul 05 '24 06:07 Frando