radicle-link icon indicating copy to clipboard operation
radicle-link copied to clipboard

Peer is flooded with new ingress streams

Open geigerzaehler opened this issue 4 years ago • 28 comments

When connecting to either setzling or sprout we see the following log line every 200ms

Jul 26 12:15:18.816  INFO accept{peer_id=hyb79bxwfz8mbmqarrok53fj9cnmmey6oujg1yfoysjjo1rmzojuhw}:
  disco:discovered{remote_id=hynkyndc6w3p8urucakobzna7sxwgcqny7xxtw88dtx3pkf7m3nrzc}:
  incoming{remote_id=hynkyndc6w3p8urucakobzna7sxwgcqny7xxtw88dtx3pkf7m3nrzc remote_addr=34.141.175.140:12345}:
  librad::net::protocol::io::streams: new ingress stream

In addition, we also see the following log line for a while until it goes away (maybe after replicating a first project)

Jul 26 12:07:56.240  WARN radicle_daemon::peer::run_state::running_waiting_room:
  waiting room error error=MissingUrn(Urn { id: Oid(1183cf230ec31058fa955e69b4172086161641de), path: Some(RefLike("rad/id")) })

This happens even if I start Upstream with a clean slate and I don’t replicate any projects.

I’ve noticed that once my peer is connected to multiple peers I’m seeing this log line for every peer. This puts significant pressure on the CPU resulting in 50% utilization. This is a high severity issue for Upstream users since the app and their machines become unresponsive

Steps to reproduce

  1. Start the most recent version (0.2.9 or master) of Upstream from the command line with a fresh profile using RAD_HOME
  2. Create a profile
  3. Log at the command line output

geigerzaehler avatar Jul 26 '21 10:07 geigerzaehler

Is it the same remote_addr always?

kim avatar Jul 27 '21 14:07 kim

Is it the same remote_addr always?

Yes it is. Once I’m connected to more peers this happens with every peer. Also the Oid for the waiting room warning is always the same.

geigerzaehler avatar Jul 27 '21 15:07 geigerzaehler

I don’t know of anything which would trigger packets at a 200ms interval, except perhaps an amplification effect of something malfunctioning with the “waiting room”

kim avatar Jul 27 '21 17:07 kim

@geigerzaehler If you could conjure up debug logs from seed that correspond to the upstream peer that would help with narrowing down the interaction that is causing this.

xla avatar Jul 27 '21 18:07 xla

I’ve noticed that once my peer is connected to multiple peers I’m seeing this log line for every peer. This puts significant pressure on the CPU resulting in 50% utilization. This is a high severity issue for Upstream users since the app and their machines become unresponsive

Does this then rule out a potential issue in the seed node/library?

cloudhead avatar Jul 27 '21 21:07 cloudhead

@geigerzaehler If you could conjure up debug logs from seed that correspond to the upstream peer that would help with narrowing down the interaction that is causing this.

Here’s the seed log just before the issue starts. Afterwards the log basically repeats itself.

seed.log

geigerzaehler avatar Jul 28 '21 10:07 geigerzaehler

A couple of things I noticed when I investigated this and that may contribute to the issue

  • We persist the waiting room whenever it’s state changes. This may happen quite frequently, for example on every Have message received by the protocol, even the ones that are uninteresting. Persistance is probably not cheap.
  • Are the calls to kv for persistence blocking on IO? If so, this might be a problem because we don’t spawn them on a separate thread?
  • We’re using tokio::task::spawn_blocking instead of unblocking throughout the daemon codebase
  • daemon::peer::announcement::run does read and write to the store but calls are not synchronized. It is possible that two calls of run() execute concurrently and mess with each other.

geigerzaehler avatar Jul 28 '21 16:07 geigerzaehler

Seems like https://github.com/radicle-dev/radicle-client-services/issues/5 is related, which doesn't have a waiting room or complex behavior. It seems like peers are establishing large amounts of connections to other peers, something we've been seeing on the seed node for a while.

It could either be malicious behavior, or some kind of feedback loop..

In any case it might help to limit the amount of ingress streams per-peerid, as a mitigation.

cloudhead avatar Jul 29 '21 08:07 cloudhead

Something I find weird about this log is that a peer keeps advertising the same Have repeatedly:

    24 Jul 28 10:00:52.008 DEBUG accept{peer_id=hybz9gfgtd9d4pd14a6r66j5hz6f77fed4jdu7pana4fxaxbt369kg}:incoming:incoming{remote_id=hyd7cb39wws6jz5wmifzwtjn8u57siw6knipqttk6nhc657r1fozaw remote_addr=188.120.235.3:12345}:tock{tock=AttemptSend { to: GenericPeerInfo { peer_id: hyd7cb39wws6jz5wmifzwtjn8u57siw6knipqttk6nhc657r1fozaw, advertised_info: PeerAdvertisement { listen_addrs: Within { inner: [188.120.235.3:12345], min: 0, max: 16 }, capabilities: {} }, seen_addrs: Within { inner: [188.120.235.3:12345], min: 0, max: 16 } }, message: Membership(Neighbour { info: PeerAdvertisement { listen_addrs: Within { inner: [10.156.15.198:12345, 127.0.0.1:12345], min: 0, max: 16 }, capabilities: {} }, need_friends: None }) }}:incoming{remote_id=hyd7cb39wws6jz5wmifzwtjn8u57siw6knipqttk6nhc657r1fozaw remote_addr=188.120.235.3:12345}:tock{tock=SendConnected { to: hybdj8d47ew35471nu4gs7pco8p9xib1iksdmtygy43ghyu5ypaumg, message: Gossip(Have { origin: GenericPeerInfo { peer_id: hydd5idud5sbtkdmb8cmuprpf3rsj4ajper3j4738f9efw7nunpgxe, advertised_info: PeerAdvertisement { listen_addrs: Within { inner: [127.0.0.1:37109, 192.168.4.41:37109], min: 0, max: 16 }, capabilities: {} }, seen_addrs: Within { inner: [], min: 0, max: 16 } }, val: Payload { urn: Urn { id: Oid(1183cf230ec31058fa955e69b4172086161641de), path: Some(RefLike("rad/id")) }, rev: Some(Git(b804589da7f790d471c8c1870d880f7ac6b471d9)), origin: None } }) }}: librad::net::protocol::tick: tock

Doing rg b804589da7f790d471c8c1870d880f7ac6b471d9 seed.log | wc will bring up 64 occurrences. They also happen very close to each other.

FintanH avatar Jul 29 '21 09:07 FintanH

In any case it might help to limit the amount of ingress streams per-peerid, as a mitigation.

But radicle-link recently added rate limiting, so this is even more curious :thinking:

FintanH avatar Jul 29 '21 09:07 FintanH

Rate limiting is limited due to the stream-per-message model. That was a bad idea, I'm changing it. At the end of the day, a peer flooding the network cannot really be defended against, unless we auto-ban them permanently.

kim avatar Jul 29 '21 09:07 kim

Any update on this? This issue often makes Upstream unusable :(

geigerzaehler avatar Aug 16 '21 14:08 geigerzaehler

Any update on this? This issue often makes Upstream unusable :(

I thought the last thing we talked about was Upstream updating librad to this commit https://github.com/radicle-dev/radicle-link/commit/cb91ccdd07d08276112c876a670941f9be8f18ef and seeing how that might improve usage?

FintanH avatar Aug 16 '21 14:08 FintanH

Also heard from @cloudhead on matrix that it was sprout misbehaving. Did no further action follow this statement?

kim avatar Aug 16 '21 15:08 kim

I thought the last thing we talked about was Upstream updating librad to this commit cb91ccd and seeing how that might improve usage?

That didn’t help unfortunately. We’re still seeing this issue.

Also heard from @cloudhead on matrix that it was sprout misbehaving.

I was also seeing this issue with setzling. And this was after I restarted it and was the only peer connected to it.

geigerzaehler avatar Aug 17 '21 08:08 geigerzaehler

I was also seeing this issue with setzling. And this was after I restarted it and was the only peer connected to it.

And there was no sign of some peer repeatedly making requests?

FintanH avatar Aug 17 '21 08:08 FintanH

 I thought the last thing we talked about was Upstream updating
 librad to this commit [1]cb91ccd and seeing how that might improve
 usage?

That didnât help unfortunately. Weâre still seeing this issue.

Yeah no, the effect of this can only kick in when most peers are running on that version. Only then can we tighten the flow control limits to drop packets from a logorrheic peer on the floor.

 Also heard from ***@***.*** on matrix that it was sprout
 misbehaving.

I was also seeing this issue with setzling. And this was after I restarted it and was the only peer connected to it.

So, what are we seeing exactly? That seed nodes are flooding, or that upstream nodes are flooding as soon as they are connected to a seed? Is it perhaps that the seed should check if the project is already tracked here 0, because the call to track is idempotent?

kim avatar Aug 17 '21 08:08 kim

I’m not able to reproduce this issue anymore if I’m the only one connected to a seed node. But I still see this issue if other peers are connected. After adding some logging statements I see the following when I connect to sprout.

Aug 18 16:54:25.173  INFOlibrad::net::peer::storage: gossip received, has: Payload { urn: Urn { id: Oid(8cc143b5b061d9882726564702017b5f94bb0f1a), path: Some(RefLike("rad/id")) }, rev: Some(Git(f2949476d30de0ee650598ba16aee4a4601157da)), origin: None }, provider: hyde43uaaqz5bnh8kpqfwiwx7inxzdmmtiy1ndtf6zazfrcny3ossr
    at /home/thomas/code/radicle-link/librad/src/net/peer/storage.rs:189
    in librad::net::peer::storage::put with has: Payload { urn: Urn { id: Oid(8cc143b5b061d9882726564702017b5f94bb0f1a), path: Some(RefLike("rad/id")) }, rev: Some(Git(f2949476d30de0ee650598ba16aee4a4601157da)), origin: None }
    in librad::net::protocol::broadcast::apply with remote_id: hynkyndc6w3p8urucakobzna7sxwgcqny7xxtw88dtx3pkf7m3nrzc, Have { origin: GenericPeerInfo { peer_id: hyde43uaaqz5bnh8kpqfwiwx7inxzdmmtiy1ndtf6zazfrcny3ossr, advertised_info: PeerAdvertisement { listen_addrs: Within { inner: [127.0.0.1:57843, 192.168.1.108:57843], min: 0, max: 16 }, capabilities: {} }, seen_addrs: Within { inner: [], min: 0, max: 16 } }, val: Payload { urn: Urn { id: Oid(8cc143b5b061d9882726564702017b5f94bb0f1a), path: Some(RefLike("rad/id")) }, rev: Some(Git(f2949476d30de0ee650598ba16aee4a4601157da)), origin: None } }
    in librad::net::protocol::io::streams::incoming with remote_id: hynkyndc6w3p8urucakobzna7sxwgcqny7xxtw88dtx3pkf7m3nrzc, remote_addr: 35.187.60.160:12345
    in librad::net::protocol::io::discovered with remote_id: hynkyndc6w3p8urucakobzna7sxwgcqny7xxtw88dtx3pkf7m3nrzc
    in librad::net::protocol::accept::disco
    in librad::net::protocol::accept with peer_id: hyb9fda6y8tuwqn73wfnh1qf4m9cau714o5xf14t7wgwbkdhs138ay

This log line is repeated every couple of milliseconds. As you can see, the message arrives from the seed but originates from another node.

Could one way to limit the impact of this be to have the waiting room not react to uninteresting messages?

geigerzaehler avatar Aug 18 '21 15:08 geigerzaehler

Could one way to limit the impact of this be to have the waiting room not react to uninteresting messages?

The thing is that if you've requested an Urn you're not guaranteed to know which PeerId it'll come from so that counts as "uninteresting" in the gossip world, however, it's interesting to the waiting room because it found a (possibly) valid peer to replicate from.

Why do you think this could be the waiting room anyway? :thinking: Maybe I'm missing something you see in the logs there? :eyes:

FintanH avatar Aug 18 '21 15:08 FintanH

Why do you think this could be the waiting room anyway?

I’m not saying the waiting room causes the issue. But it seems odd to me that the messages put so much pressure on the CPU. I was just thinking about limiting the impact of messages that don’t have an effect.

geigerzaehler avatar Aug 18 '21 16:08 geigerzaehler

So, what are we seeing exactly? That seed nodes are flooding, or that upstream nodes are flooding as soon as they are connected to a seed? Is it perhaps that the seed should check if the project is already tracked here [0], because the call to track is idempotent?

If anything this is probably the source of some unwanted traffic and we should definitely fix it.

cloudhead avatar Aug 19 '21 07:08 cloudhead

I must say that this waiting room logic escapes me mostly. It is probably the case that the bandaid of screaming "have" into the forest should just be removed -- unfortunately, #653 is still stalled, because replication v3 is more important right now.

In the spirit of graceful centralisation, I could imagine the following workaround: since most clients (upstream clients for sure) bootstrap through a "central" seed, we could have some kind of no-forward flag on wants/haves, which simply cause the seed to respond if they have the wanted data, but not forward the message.

kim avatar Aug 19 '21 08:08 kim

That's kind of interesting - is this a soft-fork?

cloudhead avatar Aug 19 '21 09:08 cloudhead

is this a soft-fork?

I'm not sure what you mean by that.

kim avatar Aug 19 '21 10:08 kim

Hehe I mean is it a backwards compatible change?

cloudhead avatar Aug 19 '21 10:08 cloudhead

Yeah sure. It would be an optional field, which should just be ignored by older clients.

kim avatar Aug 19 '21 11:08 kim

Don’t we have some rate limiting in place that should prevent a peer being flooded by messages from another peer?

geigerzaehler avatar Aug 19 '21 13:08 geigerzaehler

Donât we have some rate limiting in place that should prevent a peer being flooded by messages from another peer?

Yes we do, and I saw in one of the earlier log captures you posted that it kicks in. Might be worth playing with the thresholds.

kim avatar Aug 19 '21 13:08 kim