nimbus-eth1
nimbus-eth1 copied to clipboard
uTP / Portal stream: Investigate security issues & mitigations in implementation
e.g.:
- Limiting amount of open connections (total, per incoming/outgoing, per offer/accept or findcontent/content flows)
- Lingering connections, due to perhaps missing timeouts, or partial timeouts (per part / content item read on the socket)
- etc.
This issue has become a bit more pressing due to some OOMs of our nodes in the fleet.
Visible for example here: https://metrics.status.im/d/iWQQPuPnkadsf/nimbus-fluffy-dashboard?orgId=1&var-instance=metal-01.ih-eu-mda1.nimbus.fluffy&var-container=nimbus-fluffy-mainnet-master-01&from=1693655722341&to=1693773093911 Or here: https://metrics.status.im/d/iWQQPuPnkadsf/nimbus-fluffy-dashboard?orgId=1&var-instance=metal-01.ih-eu-mda1.nimbus.fluffy&var-container=nimbus-fluffy-mainnet-master-02&from=1693499124066&to=1693614110733
Current theory is that this is occurring when a Bridge is injecting a lot of content and a lot of offers (probably outgoing ones) are adding up.
It is know that the current system of AsyncQueues is not exactly properly designed to avoid this to happen.
Some probable issues:
-
contentQueue
in aPortalStream
is ~~unlimited. Might want to add a limit on this~~ limited, but will await when limit is reached, and this await occurs after all the content has been read over uTP stream. To avoid a build up there drop incoming sockets when that limit is reached. Or better, avoid accepting the offer in the first place. - The
offerQueue
inPortalProtocol
does have a limit. It is actually set to the same amount as the amount of workers popping off the queue. The possible issue is that several runningNeighborhoodGossip
s end up blocking because of theofferQueue
reaching its limit. Especially as for each incoming offer/accept/content cycle, potentially 8 new offers are being done. - Seems like quite some copies of the content items and content keys are being done in
NeighborhoodGossip
. This is due to the different way of how the stream passes along the data and how they are passed to the offerQueue. This combined with 2. makes it much worse. - This is a bit more of a pure assumption, but it is possible (likely?) that quite a few duplicate offers are being accepted (and possibly gossiped) at the same time. Assuming here that the same offers come from different peers around the same time. We don't avoid accepting this as long as we haven't received and verified the data. This should be verified first if it is really an issue.
-
is done in https://github.com/status-im/nimbus-eth1/pull/1753 and appears to work. So this we will already merge.
-
is done in https://github.com/status-im/nimbus-eth1/pull/1739, but has not been proven to get rid of the memory build up on its own, so closing for now. Might want to add a similar version of this still.