nimbus-eth1 icon indicating copy to clipboard operation
nimbus-eth1 copied to clipboard

uTP / Portal stream: Investigate security issues & mitigations in implementation

Open kdeme opened this issue 2 years ago • 2 comments

e.g.:

  • Limiting amount of open connections (total, per incoming/outgoing, per offer/accept or findcontent/content flows)
  • Lingering connections, due to perhaps missing timeouts, or partial timeouts (per part / content item read on the socket)
  • etc.

kdeme avatar May 12 '22 15:05 kdeme

This issue has become a bit more pressing due to some OOMs of our nodes in the fleet.

Visible for example here: https://metrics.status.im/d/iWQQPuPnkadsf/nimbus-fluffy-dashboard?orgId=1&var-instance=metal-01.ih-eu-mda1.nimbus.fluffy&var-container=nimbus-fluffy-mainnet-master-01&from=1693655722341&to=1693773093911 Or here: https://metrics.status.im/d/iWQQPuPnkadsf/nimbus-fluffy-dashboard?orgId=1&var-instance=metal-01.ih-eu-mda1.nimbus.fluffy&var-container=nimbus-fluffy-mainnet-master-02&from=1693499124066&to=1693614110733

Current theory is that this is occurring when a Bridge is injecting a lot of content and a lot of offers (probably outgoing ones) are adding up.

It is know that the current system of AsyncQueues is not exactly properly designed to avoid this to happen.

Some probable issues:

  1. contentQueue in a PortalStream is ~~unlimited. Might want to add a limit on this~~ limited, but will await when limit is reached, and this await occurs after all the content has been read over uTP stream. To avoid a build up there drop incoming sockets when that limit is reached. Or better, avoid accepting the offer in the first place.
  2. The offerQueue in PortalProtocol does have a limit. It is actually set to the same amount as the amount of workers popping off the queue. The possible issue is that several running NeighborhoodGossips end up blocking because of the offerQueue reaching its limit. Especially as for each incoming offer/accept/content cycle, potentially 8 new offers are being done.
  3. Seems like quite some copies of the content items and content keys are being done in NeighborhoodGossip. This is due to the different way of how the stream passes along the data and how they are passed to the offerQueue. This combined with 2. makes it much worse.
  4. This is a bit more of a pure assumption, but it is possible (likely?) that quite a few duplicate offers are being accepted (and possibly gossiped) at the same time. Assuming here that the same offers come from different peers around the same time. We don't avoid accepting this as long as we haven't received and verified the data. This should be verified first if it is really an issue.

kdeme avatar Sep 08 '23 12:09 kdeme

  1. is done in https://github.com/status-im/nimbus-eth1/pull/1753 and appears to work. So this we will already merge.

  2. is done in https://github.com/status-im/nimbus-eth1/pull/1739, but has not been proven to get rid of the memory build up on its own, so closing for now. Might want to add a similar version of this still.

kdeme avatar Sep 16 '23 14:09 kdeme