lighthouse icon indicating copy to clipboard operation
lighthouse copied to clipboard

Peer count slowly decrease to 0

Open mask-pp opened this issue 1 year ago • 9 comments

Description

Please provide a brief description of the issue.

The peer count of beacon_node will slowly decrease to 0 in Holesky network.

Please provide your Lighthouse and Rust version. Are you building from stable or unstable, which commit?

sigp/lighthouse:v5.3.0

Describe the present behaviour of the application, with regards to this issue.

issue behavior: Once the peer count is lower than a certain threshold(about 96), the count value begins to slowly decrease but no possibility of any increase.

geth(v1.13.15) cmd: `

  • geth --holesky
    --datadir /data/holesky-node-full
    --metrics
    --metrics.addr "0.0.0.0"
    --http
    --http.addr "0.0.0.0"
    --http.vhosts ""
    --http.corsdomain "
    "
    --http.api eth,net,web3,txpool
    --ws
    --ws.addr "0.0.0.0"
    --ws.origins ""
    --ws.api eth,net,web3,txpool
    --authrpc.addr "0.0.0.0"
    --authrpc.vhosts "
    "
    --authrpc.jwtsecret /etc/jwt/secret.hex
    --nat extip:$EXT_IP
    --allow-insecure-unlock
    --v5disc `

beacon(v5.3.0) cmd: `

  • lighthouse - beacon - --network - holesky - --datadir - /data/lighthouse - --execution-endpoint - http://localhost:8551 - --execution-jwt - /etc/jwt/secret.hex - --checkpoint-sync-url - https://checkpoint-sync.holesky.ethpandaops.io - --disable-deposit-contract-sync - --http-address - 0.0.0.0 - --http `

How should the application behave?

The normal peer count is about 100 in holesky network, and should be restored automatically when the peer count is too few.

Please describe the steps required to resolve this issue, if known.

mask-pp avatar Sep 11 '24 01:09 mask-pp

I think we are going to need some logs to diagnose this.

The described behaviour is similar to a node that loses an internet connection.

Debug logs can be found in the /data/lighthouse/holesky/beacon_node/logs directory. Pasting those here, or DM'ing me on discord (@AgeManning) will help us figure out the issue.

AgeManning avatar Sep 11 '24 09:09 AgeManning

I think we are going to need some logs to diagnose this.

The described behaviour is similar to a node that loses an internet connection.

Debug logs can be found in the /data/lighthouse/holesky/beacon_node/logs directory. Pasting those here, or DM'ing me on discord (@AgeManning) will help us figure out the issue.

Thx man, since the debug log is too large, I need to crop out the useful information and then send it to you.

mask-pp avatar Sep 11 '24 23:09 mask-pp

@mask-pp The logs compress well. It's best if you can compress them and send the whole file as it is all potentially relevant

Something else you could check would be your time sync. Make sure you've got NTP running and that sudo timedatectl status shows you're synced. You could also try Chrony.

michaelsproul avatar Sep 11 '24 23:09 michaelsproul

I think we are going to need some logs to diagnose this. The described behaviour is similar to a node that loses an internet connection. Debug logs can be found in the /data/lighthouse/holesky/beacon_node/logs directory. Pasting those here, or DM'ing me on discord (@AgeManning) will help us figure out the issue.

Thx man, since the debug log is too large, I need to crop out the useful information and then send it to you.

@AgeManning Hei friend, I have sent u the get log cmd privately. Hope these logs are helpful in solving the issue.

mask-pp avatar Sep 12 '24 06:09 mask-pp

Are you on VPS or running Lighthouse locally?

chong-he avatar Sep 12 '24 07:09 chong-he

Are you on VPS or running Lighthouse locally?

Running in k8s

mask-pp avatar Sep 12 '24 07:09 mask-pp

Linking a similar issue here: https://github.com/sigp/lighthouse/issues/5271

chong-he avatar Sep 12 '24 08:09 chong-he

I have been through these logs.

The logs show "Socket Updated" (you can grep through the logs for this).

This log indicates that discovery is changing contactable IP/PORT based on what others see as the src in the packets they receive. It starts out with a port of 9000 (which is usually correct) then when it changes to a random other port, lighthouse can no longer discover peers. This is because other nodes will not respond if the ENR has invalid settings.

It typically means that the router or gateway is sending traffic to other peers on different ports other than 9000. This could be because of a symmetric nat for example. Usually on home router's this means the ports are not forward'ed correctly. Setting up a UDP port forward should make the router move traffic in and out through the same external port. If it is using other random ports, the ENR can be updated incorrectly.

I've seen this happen a few times and there is some changes to discovery we can make that might improve this situation. I'll make some PRs.

The immediate solution is to verify why traffic is being sent out on different random ports and to double check the NAT configuration for the UDP discovery traffic.

AgeManning avatar Sep 15 '24 23:09 AgeManning

For reference. I'm suggesting the following discv5 change: https://github.com/sigp/discv5/pull/265

This will resolve this issue once we make a discv5 release and put it into lighthouse. This should resolve a bunch of other related issues.

There is a downside however. This change will allow lighthouse nodes to find and maintain peers, but a misconfigured NAT will now be harder to identify, because Lighthouse will work (partially). It will be harder to notice this misconfiguration. The result will be that inbound peers will not join, because the ENR will not be advertised.

We have a metric (in grafana, in the network dash) that tells if the NAT is correctly open. Also the HTTP API lighthouse/nat should be an indicator if the nat is configured correctly.

AgeManning avatar Sep 19 '24 01:09 AgeManning

For reference. I'm suggesting the following discv5 change: sigp/discv5#265

This will resolve this issue once we make a discv5 release and put it into lighthouse. This should resolve a bunch of other related issues.

There is a downside however. This change will allow lighthouse nodes to find and maintain peers, but a misconfigured NAT will now be harder to identify, because Lighthouse will work (partially). It will be harder to notice this misconfiguration. The result will be that inbound peers will not join, because the ENR will not be advertised.

We have a metric (in grafana, in the network dash) that tells if the NAT is correctly open. Also the HTTP API lighthouse/nat should be an indicator if the nat is configured correctly.

hi, so I was having this exact problem for almost a year with prysm and geth on my windows 10 after it working perfect for 3 years since genesis.. Eventually geth could not sync anymore so I switched to Nethermind which works perfectly.. But still was dropping peers to 0 with prysm.. Finally had enough and switched to Lighthouse and it's been great.. Although I've only been able to hover between 50% and 90% attestations performance... My peer count does not seem to go above 18 no matter what I try and it seems to average around 10.. The good thing is at least it's maintaining and when it does go to 0 (occassionally) it eventually recovers by itself..

I'm confused about what this "downside" is and weather I'm having the problem / misconfiguation with my setup..

thanks

defeedme avatar Mar 05 '25 08:03 defeedme

For reference. I'm suggesting the following discv5 change: sigp/discv5#265 This will resolve this issue once we make a discv5 release and put it into lighthouse. This should resolve a bunch of other related issues. There is a downside however. This change will allow lighthouse nodes to find and maintain peers, but a misconfigured NAT will now be harder to identify, because Lighthouse will work (partially). It will be harder to notice this misconfiguration. The result will be that inbound peers will not join, because the ENR will not be advertised. We have a metric (in grafana, in the network dash) that tells if the NAT is correctly open. Also the HTTP API lighthouse/nat should be an indicator if the nat is configured correctly.

hi, so I was having this exact problem for almost a year with prysm and geth on my windows 10 after it working perfect for 3 years since genesis.. Eventually geth could not sync anymore so I switched to Nethermind which works perfectly.. But still was dropping peers to 0 with prysm.. Finally had enough and switched to Lighthouse and it's been great.. Although I've only been able to hover between 50% and 90% attestations performance... My peer count does not seem to go above 18 no matter what I try and it seems to average around 10.. The good thing is at least it's maintaining and when it does go to 0 (occassionally) it eventually recovers by itself..

I'm confused about what this "downside" is and weather I'm having the problem / misconfiguation with my setup..

thanks

Windows is tricky to run a node. As a first start, did you setup your time sync correctly? See: https://ethdocker.com/Support/Windows/

chong-he avatar Mar 05 '25 10:03 chong-he

For reference. I'm suggesting the following discv5 change: sigp/discv5#265 This will resolve this issue once we make a discv5 release and put it into lighthouse. This should resolve a bunch of other

Windows is tricky to run a node. As a first start, did you setup your time sync correctly? See: https://ethdocker.com/Support/Windows/

yes thats the first thing I did and use net time

defeedme avatar Mar 05 '25 11:03 defeedme

It is worth having a read of this blog post which might help with NAT configurations (which is usually a primary cause for low peer count): https://blog.sigmaprime.io/lighthouse-nat.html

AgeManning avatar Mar 05 '25 22:03 AgeManning

Closing as resolved/stale. Feel free to reopen if you're still having issues

michaelsproul avatar Jul 29 '25 23:07 michaelsproul