go-bitswap icon indicating copy to clipboard operation
go-bitswap copied to clipboard

Failure to fetch content from connected nodes

Open vdamle opened this issue 4 years ago • 8 comments

Environment

IPFS Version:

go-ipfs version: 0.9.1-dc2715a
Repo version: 11
System version: amd64/linux
Golang version: go1.15.2

We are running a private IPFS cluster with 2 or more nodes (docker/k8s)

Description

As an example, we've got a network with 2 nodes - node1 and node2. After startup, we observe that node2 cannot fetch content from node1 (a bootstrap peer) because it is not registered as a Partner. We see on node1 that node2 is registered as a Peer/Partner. I tried adding some additonal debug logs and best I could tell, the steps to register a BitSwapNetwork Receiver to receive events from the network was not complete before node2 received a "connected" notification from node1. It seems to me like this is a race condition and we see this occur 90% of the time, when we spin up nodes (ref: https://github.com/ipfs/go-bitswap/blob/0e81f7c1fef795a8d34cdd494fa2d8fdc045d199/bitswap.go#L215 ). As a throw of dice, I tried to move up the call before any options/decision engine is setup, this did not make a whole lot of difference to the outcome.

Workaround

  • Execute ipfs swarm peer disconnect <address>
  • Execute ipfs swarm peer connect <address>

We can see that the PeerConnected handler is invoked from Connected, which adds the Peer to the peer list. After this, WANT_HAVE is sent as expected to the peer and content is retrieved.

PS: Originally, I created the following issue in go-ipfs, but after further investigation, it feels like go-bitswap is the right place to report this.

I'd love to hear any suggestions/thoughts for how this can be addressed since I have a limited view of the architecture/intent for how connections are supposed to work. I have attached logs and output of commands from both nodes.

ipfs-config-node1.txt ipfs-config-node2.txt ipfs-id-node1.txt ipfs-id-node2.txt ipfs-node1.log ipfs-node2.log

vdamle avatar Aug 30 '21 20:08 vdamle

Thank you for submitting your first issue to this repository! A maintainer will be here shortly to triage and review. In the meantime, please double-check that you have provided all the necessary information to make this process easy! Any information that can help save additional round trips is useful! We currently aim to give initial feedback within two business days. If this does not happen, feel free to leave a comment. Please keep an eye on how this issue will be labeled, as labels give an overview of priorities, assignments and additional actions requested by the maintainers:

  • "Priority" labels will show how urgent this is for the team.
  • "Status" labels will show if this is ready to be worked on, blocked, or in progress.
  • "Need" labels will indicate if additional input or analysis is required.

Finally, remember to use https://discuss.ipfs.io if you just need general support.

welcome[bot] avatar Aug 30 '21 20:08 welcome[bot]

We usually handle this by setting up the node as follows:

  1. Instantiate the libp2p node, but don't start listening on addresses.
  2. Instantiate any services that need libp2p.
  3. Finally, start listening.

However, ideally bitswap would add all existing peers on startup. If you want to write a quick patch to do that, I'd be happy to take a look.

Stebalien avatar Sep 09 '21 11:09 Stebalien

@vdamle : is this a patch you're interested in contributing?

BigLep avatar Sep 24 '21 15:09 BigLep

Sorry, I missed the update @BigLep / @Stebalien - thanks for the input!

I'd love to contribute a patch.

@Stebalien : Here's what I'm thinking:

  • By existing peers, I assume you refer to the list of bootstrap peers?

  • Assuming the above, pass the Config.Bootstrap peer list as an Option before the call to bitswap.New() by adding the peer addresses in the bootstrap list to https://github.com/ipfs/go-bitswap/blob/958b1630052082b819a554572e21c1afbf64ce5d/bitswap.go#L299

  • Asynchronously connect to the peer list right before starting the workers https://github.com/ipfs/go-bitswap/blob/958b1630052082b819a554572e21c1afbf64ce5d/bitswap.go#L281

Please let me know if I'm on the right track. In the meantime, I will try to make the above changes and test in my private network deployment.

vdamle avatar Sep 24 '21 20:09 vdamle

So, I tried to implement it and realized this just isn't going to work, or at least not "well" because your libp2p host may have received connections before it's listening on the bitswap protocol.

The correct fix is to:

  1. Wait for "identify" and "protocol changed" events on the libp2p event bus instead of connect events. That way, two peers will only "see" each other once they start listening on the bitswap protocol, not when they first connect.
  2. On start, iterate through all peers and check to see if they speak the bitswap protocol. If so, simulate an event as in step 1.

But this needs to be done carefully to make sure it's somewhat atomic.

Stebalien avatar Sep 26 '21 14:09 Stebalien

@vdamle: do you have a better idea of what a patch would need to look like given @Stebalien 's feedback?

BigLep avatar Oct 01 '21 15:10 BigLep

@vdamle : is this something you'd be to contribute?

2022-01-07 conversation: @aschmahmann ran into this as well in 202112 while working on Testground/integration testing. Not fixing this means tests have to work around this issue.

BigLep avatar Jan 07 '22 16:01 BigLep

@BigLep @Stebalien I took a stab at this in https://github.com/ipfs/go-bitswap/pull/590. Am I on the right track?

peterargue avatar Oct 11 '22 22:10 peterargue

Oops, seems like we needed more information for this issue, please comment with more details or this issue will be closed in 7 days.

github-actions[bot] avatar Oct 18 '22 00:10 github-actions[bot]

what additional information is needed?

peterargue avatar Oct 18 '22 03:10 peterargue

@peterargue : apologies for the confusion. It looks like we need to do some bot adjustments for when the ball is back in the maintainer's court. We'll need maintainers to look at this (next triage is scheduled for 2022-10-20).

BigLep avatar Oct 18 '22 03:10 BigLep

@BigLep no worries and thanks for the update

peterargue avatar Oct 18 '22 22:10 peterargue

@BigLep any idea when the team will have some time to take a look at this PR? FWIW, we've been running with these changes on a live network for the last 4 weeks with great results. All of the correct peers are added on startup and updated as expected.

peterargue avatar Nov 24 '22 23:11 peterargue

Hi @peterargue : I personally haven't been at the most recent triage meetings to give more insight.

@guseggert : has this come up during weekly triage? Do we have any initial feedback here?

BigLep avatar Nov 27 '22 21:11 BigLep