go-libp2p icon indicating copy to clipboard operation
go-libp2p copied to clipboard

Lotus sync issue: libp2p 0.31.1 to 0.33.2 regression

Open Stebalien opened this issue 1 year ago • 23 comments
trafficstars

We've seen reports of a chain-sync regression between lotus 1.25 and 1.26. Notably:

  1. We updated go-libp2p from v0.31.1 to v0.33.2.
  2. I've seen reports of peers failing to resume sync after transient network issues.
  3. Users are reporting "low" peer counts.

We're not entirely sure what's going on, but I'm starting an issue here so we can track things.

Stebalien avatar Apr 11 '24 20:04 Stebalien

My first guess, given (2), is https://github.com/libp2p/specs/issues/573#issuecomment-2050421876. This is unconfirmed, but high on my list.

  • [ ] Test: does disabling tcp reuseport fix this?

Stebalien avatar Apr 11 '24 20:04 Stebalien

My second guess is https://github.com/libp2p/go-libp2p/issues/2650. This wouldn't be the fault of libp2p, but TLS may be more impacted by the GFW? That seems unlikely...

Stebalien avatar Apr 11 '24 20:04 Stebalien

My third guess is something related to QUIC changes.

Stebalien avatar Apr 11 '24 20:04 Stebalien

Have you been able to repro 2 or 3 locally?

  • For GFW theory, we could try connecting to peers over both tls and noise and seeing if there's a difference.
  • Can you run lotus 1.26 on the older version of go-libp2p and see if you still see any errors?
  • Is the transient network issue something that would affect my connectivity to everyone or only a subset of peers? e.g. is my internet down or is my connection to a subset down?
  • For a typical well behaved node, what's the breakdown in connection types? (TCP+TLS, QUIC, TCP+Noise). For a node seeing this regression, what is its breakdown?

MarcoPolo avatar Apr 15 '24 17:04 MarcoPolo

I can't repro this at the moment, unfortunately (not at home, node down). But I'll do some more digging later this week.

Stebalien avatar Apr 15 '24 17:04 Stebalien

Ok, I got one confirmation that disabling reuseport seems to fix the issue and one report that it makes no difference.

Stebalien avatar Apr 18 '24 16:04 Stebalien

Ok, that confirmation appeared to be a fluke. This doesn't appear to have been the issue

Stebalien avatar Apr 22 '24 14:04 Stebalien

From eyeballing the commits, I can see that the major changes apart from WebRTC are

  • We've upgraded QUIC
  • Implemented Happy eyeballs for TCP
  • removed multistream simultaneous connect

Can we test this with an only QUIC node and an only TCP node to see if it's a problem with QUIC or TCP?

sukunrt avatar Apr 25 '24 08:04 sukunrt

I'll try. Unfortunately, the issue is hard to reproduce and tends to happen in production (hard to get people to run random patches). Right now we're waiting on goroutine dumps hoping to get a bit of an idea about what might be stuck (e.g., may not be libp2p).

Stebalien avatar Apr 25 '24 14:04 Stebalien

It might be the silently broken PX -- see https://github.com/libp2p/go-libp2p-pubsub/pull/555

vyzo avatar Apr 25 '24 15:04 vyzo

I am almost certain this is the culprit as the bootstrap really relies on it.

vyzo avatar Apr 25 '24 15:04 vyzo

AH.. that would definitely explain it.

Stebalien avatar Apr 25 '24 15:04 Stebalien

I thought that could be it as well, but I was thrown off by the premise that this wasn't an issue in v0.31.1.

PX broke after this change: https://github.com/libp2p/go-libp2p/pull/2325 which was included in the v0.28.0 release. So v0.31.1 should have the same PX issue.

MarcoPolo avatar Apr 25 '24 17:04 MarcoPolo

I cant imagine what else it could be. Was there a recent "mandatory release" where everyone upgraded to the more recent libp2p?

vyzo avatar Apr 25 '24 17:04 vyzo

Users are reporting "low" peer counts.

Are these low peer counts low peers in your gossipsub mesh or low number of peers we are actually connected to?

MarcoPolo avatar Apr 25 '24 18:04 MarcoPolo

Do we know if these nodes are running both QUIC and TCP? If yes, it's unlikely that the problem is with either transport and is probably at a layer above the go-libp2p transports?

sukunrt avatar Apr 25 '24 18:04 sukunrt

Are these low peer counts low peers in your gossipsub mesh or low number of peers we are actually connected to?

Just chiming in here from the Lotus-side, it´s the number of peers we are connected to, after upgrading to 0.33.2 the count is around:

lotus info
Network: mainnet
Peers to: [publish messages 105] [publish blocks 106]

On the previos version (0.33.1), it was stable around the 200 range.

rjan90 avatar May 03 '24 11:05 rjan90

Just chiming in here from the Lotus-side, it´s the number of peers we are connected to, after upgrading to 0.33.2 the count is around:

lotus info
Network: mainnet
Peers to: [publish messages 105] [publish blocks 106]

On the previos version (0.33.1), it was stable around the 200 range.

I think these are the number of peers in your gossipsub topic mesh. A subset of the peers you are actually connected to. Could you find the number of peers you are connected to? And compare that between versions?

MarcoPolo avatar May 03 '24 16:05 MarcoPolo

Did the situation improve after gossip-sub v0.11 and go-libp2p v0.34?

sukunrt avatar Jun 20 '24 16:06 sukunrt

We'll likely need to wait for the network to upgrade (~August) to see results.

Stebalien avatar Jun 22 '24 20:06 Stebalien

I have a user with a large ipfs-cluster (>1000 peers) complaining of issues that are consistent with pubsub propagation failures and the issue happens in both go-libp2p-v0.33.2 + go-libp2p-pubsub v0.10.0 and go-libp2p-v0.35.1 with go-libp2p-pubsub-v0.11.0. I cannot 100% say it is the same issue as Lotus, but "low peer counts" is a symptom and it is still happening, apparently.

How confident are we that it was fixed?

hsanjuan avatar Jun 25 '24 21:06 hsanjuan

We're confident that we fixed a issue, but there may be others. My initial thought was https://github.com/libp2p/go-libp2p/issues/2764#issuecomment-2050450383, but if that cluster uses QUIC it shouldn't be affected by that.

Stebalien avatar Jun 25 '24 21:06 Stebalien

Did the situation improve after gossip-sub v0.11 and go-libp2p v0.34?

So it has improved since upgrading to these version, and the amount of peers are now more stably hovering around 300 peers with the same machine:

lotus info
Network: mainnet
StartTime: 452h37m58s (started at 2024-06-11 15:28:26 +0200 CEST)
Chain: [sync ok] [basefee 100 aFIL] [epoch 4047852]
Peers to: [publish messages 308] [publish blocks 318]

As Steven notes, the real test would be waiting for the network upgrade in August - as that is when most of these issues gets surfaced when people are upgrading and reconnecting to the network.

rjan90 avatar Jun 30 '24 10:06 rjan90

We're confident that we fixed a issue, but there may be others. My initial thought was #2764 (comment), but if that cluster uses QUIC it shouldn't be affected by that.

Good news: it seems that the issue I described was a user configuration error in the end (very low limits in connection manager).

hsanjuan avatar Jul 01 '24 12:07 hsanjuan

Now that the Filecoin Mainnet has upgraded to NV23, and with that, a very large % of the nodes probably have updated to the go-libp2p v0.35.4 release - I´m seeing a significantly larger amount of peers that I´m connected to. It is 5x higher then the amount of peers I was connected to with the same machine in May

lotus info
Network: mainnet
StartTime: 222h51m57s (started at 2024-07-28 10:30:12 +0200 CEST)
Chain: [sync ok] [basefee 100 aFIL] [epoch 4155044]
Peers to: [publish messages 473] [publish blocks 490]

I think we can close this issue now, and rather re-open a more narrowed down issues if we encounter other problems

rjan90 avatar Aug 06 '24 15:08 rjan90