go-libp2p icon indicating copy to clipboard operation
go-libp2p copied to clipboard

How to set limits properly

Open b00f opened this issue 2 years ago • 4 comments
trafficstars

Regarding the resource manager limits, we have defined the limits as follows:

maxConns := conf.MaxConns // default is 16
minConns := conf.MinConns // default is 8
limit := lp2prcmgr.DefaultLimits

limit.SystemBaseLimit.ConnsInbound = logScale(maxConns)
limit.SystemBaseLimit.Conns = logScale(2 * maxConns)
limit.SystemBaseLimit.StreamsInbound = logScale(maxConns)
limit.SystemBaseLimit.Streams = logScale(2 * maxConns)

limit.ServiceLimitIncrease.ConnsInbound = logScale(minConns)
limit.ServiceLimitIncrease.Conns = logScale(2 * minConns)
limit.ServiceLimitIncrease.StreamsInbound = logScale(minConns)
limit.ServiceLimitIncrease.Streams = logScale(2 * minConns)

limit.TransientBaseLimit.ConnsInbound = logScale(maxConns / 2)
limit.TransientBaseLimit.Conns = logScale(2 * maxConns / 2)
limit.TransientBaseLimit.StreamsInbound = logScale(maxConns / 2)
limit.TransientBaseLimit.Streams = logScale(2 * maxConns / 2)

limit.TransientLimitIncrease.ConnsInbound = logScale(minConns / 2)
limit.TransientLimitIncrease.Conns = logScale(2 * minConns / 2)
limit.TransientLimitIncrease.StreamsInbound = logScale(minConns / 2)
limit.TransientLimitIncrease.Streams = logScale(2 * minConns / 2)

By default, the minimum is set at 8, and the maximum is set at 16 connections. This means that each node only needs connections with 8 to 16 other nodes.

Consider users who are running the node on their personal computers; some also run the node on a VPS with only 1 or 2 cores and 2 GB of RAM.

So far, the syncing process and networking work smoothly, especially for the consensus messages; we don't have any issues. We have about 400+ computers in our network. However, there are some strange logs in our system that I want to discuss with you:

  1. Failed to open stream: this makes us worried
stream-xxxx: transient: cannot reserve stream: resource limit exceeded
  1. Failed to identify: We have many failed protocol negotiation, once failed we close the connection with the node.
INFO net/identify identify/id.go:427 failed to negotiate identify protocol with peer {"peer": "12D3Koo...", "error": "Application error 0x0 (local)"}
WARN net/identify identify/id.go:399 failed to identify 12D3Koo...: Application error 0x0 (local)
  1. Information about connection manager: Probably not important, but not bad to mention here
INFO connmgr connmgr/connmgr.go:490 open connection count above limit, but too many are in the grace period

Implementation is available here: https://github.com/pactus-project/pactus/tree/main/network

Thanks in advance for your help


The image is reported by one of the community member: image

b00f avatar Nov 02 '23 16:11 b00f

we enabled metrics to monitoring libp2p better. you can see our monitoring dashboard here.

amirvalhalla avatar Nov 03 '23 17:11 amirvalhalla

If you create your own DHT network, there is no need to change the limits. And everything will work faster. It's very easy to do.

master255 avatar Nov 05 '23 23:11 master255

Failed to open stream: this makes us worried

stream-xxxx: transient: cannot reserve stream: resource limit exceeded

This happens for new streams that are pending protocol negotiation via multistream. These are the limits set by the limit.TransientBaseLimit.* config values. If you In your case, if you have more than 8 streams pending multistream negotiation, you'll trigger this issues. It will help to enable metrics to debug why this is happening.

INFO connmgr connmgr/connmgr.go:490 open connection count above limit, but too many are in the grace period

total connection cound is over connmgr.LowWaterMark but it cannot trim these new connections because they have been around for less than grace period. This is not a problem in itself, especially if your resource manager limits are set right.

INFO net/identify identify/id.go:427 failed to negotiate identify protocol with peer {"peer": "12D3Koo...", "error": "Application error 0x0 (local)"}
WARN net/identify identify/id.go:399 failed to identify 12D3Koo...: Application error 0x0 (local)

This happens when the peer closed the connection before you could run identify on it. Again not sure why this happens without some understanding of your code. It'll help to enable metrics to at least understand why this is happening. One theory is that one side is dropping the new stream because it has exceeded its transient connection limit and the other side was going to negotiate identify on this stream and so the other side prints this log line.

If it's possible, can you run the nodes with debug logs and metrics enabled?

sukunrt avatar Nov 14 '23 09:11 sukunrt

@sukunrt,

Thank you for your invaluable assistance in addressing the issues in this thread.

We have implemented a connection gater to prevent new connections to open on limit, and this has significantly reduced the number of connection-related errors reported by users. However, we are still encountering some issues, particularly with connections to peers where protocol negotiation is not completed.

Getting supporting protocols is very important for us, as it allows each node to handshake with its neighbors (using streams) in order to start the syncing process. The number of connections without a supporting protocol exchange is significant and appears to be abnormal.

Do you have any idea what leads to nodes opening connections without exchanging supporting protocols? This information would be really helpful in our efforts to resolve these issues.

By the way, I believe @amirvalhalla has sent a link to a specific metric here.

b00f avatar Nov 16 '23 13:11 b00f