go-libp2p icon indicating copy to clipboard operation
go-libp2p copied to clipboard

Issues with identify: identify failed to complete

Open vyzo opened this issue 1 year ago • 11 comments

The problem: when I disconnect and reconnect to a peer, the latter finds it impossible to open a stream, with the error indication that identified failed to complete. This happens consistently in our node, and I can trigger it reliably; so there is some bug related to identify.

Relevant logs:

2024-09-27T12:48:24.680Z	DEBUG	basichost	basic/basic_host.go:483	negotiated: /ipfs/id/push/1.0.0 (took 27.52µs)
2024-09-27T12:48:24.680Z	DEBUG	net/identify	identify/id.go:543	/ipfs/id/push/1.0.0 received message from 12D3KooWEuVEZ26PwP7f8mYxuQ97ChiqLvXL5JeDfmXmqN4ds3YT /ip4/45.139.212.29/udp/17846/quic-v1
2024-09-27T12:48:30.475Z	DEBUG	basichost	basic/basic_host.go:483	negotiated: actor/root/messages/0.0.1 (took 48.66µs)
2024-09-27T12:48:30.479Z	DEBUG	actor	actor/dispatch.go:205	dispatching message from {BAAREICLTJJ63SCP5MUFVGX7GUM42QL2I5E46I355CASAEDIL55VX4SHTA====== did:key:z6MkjYP7Q8m862LdqNtExNMcH1BzgAkYTVp6QC5gHesktKbu {12D3KooWEuVEZ26PwP7f8mYxuQ97ChiqLvXL5JeDfmXmqN4ds3YT root}} to /public/hello
2024-09-27T12:48:30.479Z	DEBUG	node	node/public.go:59	hello from 12D3KooWEuVEZ26PwP7f8mYxuQ97ChiqLvXL5JeDfmXmqN4ds3YT
2024-09-27T12:48:30.479Z	DEBUG	swarm2	swarm/swarm.go:475	[12D3KooWFcvZkrs1LTQCdGEVPYi7diCL76Z9f1LcgcMMspujKWTj] opening stream to peer [12D3KooWEuVEZ26PwP7f8mYxuQ97ChiqLvXL5JeDfmXmqN4ds3YT]
2024-09-27T12:48:33.450Z	WARN	libp2p	libp2p/libp2p.go:380	send: failed to open stream to peer 12D3KooWEuVEZ26PwP7f8mYxuQ97ChiqLvXL5JeDfmXmqN4ds3YT: identify failed to complete: context deadline exceeded
Version Information
	github.com/libp2p/go-libp2p v0.36.4

vyzo avatar Sep 27 '24 12:09 vyzo

I have worked around it by manually making streams and skipping the identify wait, but there certainly seem to be a bug in identify here.

vyzo avatar Sep 27 '24 13:09 vyzo

Is there any more info you can provide? A trivial connect -> disconnect -> connect test doesn't reproduce this for me.

when I disconnect and reconnect to a peer, the latter finds it impossible to open a stream

Do you mean when you open a stream to the peer after reconnecting, the error is identify failed to complete?

I see DEBUG log about NewStream failing with id failed to complete Do you get any DEBUG logs from the identify package which are relevant to the peer?

sukunrt avatar Oct 07 '24 13:10 sukunrt

I am trying to understand the problem better, hipefully i can get you a good log package to diagnose this.

But yes, we fail to open streams because identify fails to complete.

vyzo avatar Oct 07 '24 13:10 vyzo

We observed similar behavior when event bus subscriptions were not read fast enough on our side. A client connects and initiates identify; a server processes a new connection in the swarm and blocks never reaching start and thus not processing identify streams. JFYI

Wondertan avatar Oct 07 '24 14:10 Wondertan

when I disconnect and reconnect to a peer, the latter finds it impossible to open a stream

I'm not sure I parse this sentence correctly. I'm understanding it as meaning:

  1. Peer A is connected to peer B.
  2. Peer A disconnects to peer B; then reconnects to peer B.
  3. Peer B attempts to open a stream to peer A.
  4. Peer B fails to open a stream.

Is that understanding correct? Or is it instead:

  1. Peer A is connected to peer B.
  2. Peer A disconnects to peer B; then reconnects to peer B.
  3. Peer A attempts to open a stream to peer B.
  4. Peer A fails to open a stream.

The issue kind of sounds like we aren't picking the best conn, and then we try to identify on it.

MarcoPolo avatar Oct 09 '24 00:10 MarcoPolo

It is the first scenario.

vyzo avatar Oct 09 '24 06:10 vyzo

ah okay. That makes sense. The issue is probably that Peer B doesn't realize that the "best connection" it picked is actually closed/disconnected. So it times out on waiting for that connection to Identify (it never will).

We can be smarter here and interrupt with an even better connection if a new one appears.

MarcoPolo avatar Oct 09 '24 16:10 MarcoPolo

Maybe we should have a collective identify completion channel for the peer, and not one for each conn.

vyzo avatar Oct 09 '24 16:10 vyzo

This relates to #2355 and attaching protocol information to connections instead of peers.

Stebalien avatar Oct 18 '24 23:10 Stebalien

fyi, I have a branch I'm working on that should solve this issue and improve the best connection logic. The basic idea is to create a small new service that subscribes to Identify events and fulfills request to return the best connection for a peer that supports a given protocol and other criteria (e.g. is it a limited connection? Prefer a connection with existing streams). I'll get it pushed soon after the next go-libp2p release.

MarcoPolo avatar Oct 21 '24 18:10 MarcoPolo

These comments together:

The problem: when I disconnect and reconnect to a peer, the latter finds it impossible to open a stream, with the error indication that identified failed to complete. This happens consistently in our node, and I can trigger it reliably; so there is some bug related to identify.

and

I have worked around it by manually making streams and skipping the identify wait

Makes me rule out that this is only an issue with a dropped connection, like what I mentioned. Since in that case the workaround of manually making streams should also not work.

My current theory is that this is related to the eventbus being stalled as @Wondertan points out. It would be good to revisit https://github.com/libp2p/go-libp2p/issues/2361. The main argument against it was "We already have metrics, we don't need metrics AND logs." Looking at this again now, I still think having logs in addition would be nice, since setuping grafana and prometheus is non-trivial (and a big ask of our users), and issues like these should be easier to debug.

I'll make a PR to add this logging. I'll tag vyzo to try it and see if that is indeed their issue.

MarcoPolo avatar Nov 07 '24 21:11 MarcoPolo