substrate icon indicating copy to clipboard operation
substrate copied to clipboard

Peering mismatch

Open tomaka opened this issue 3 years ago • 3 comments

When connecting a local Substrate node to a bootnode, the local Substrate node thinks that it has successfully peered with the bootnode, while in reality the bootnode has kicked out the Substrate node.

This has been investigated on Westend, but I've originally seen it happen on Polkadot.

The bootnode logs:

image

Here is the local Substrate node logs:

2022-10-06 13:49:56.411 TRACE tokio-runtime-worker sub-libp2p: PSM => Connect(12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K, SetId(0)): Starting to connect    
2022-10-06 13:49:56.411 TRACE tokio-runtime-worker sub-libp2p: Libp2p <= Dial 12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K    
2022-10-06 13:49:56.413 TRACE tokio-runtime-worker sub-libp2p: Addresses of PeerId("12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K"): ["/dns/3.westend.paritytech.net/tcp/30333", "/dns/3.westend.paritytech.net/tcp/30334/ws", "/dns/3.westend.paritytech.net/tcp/30333", "/dns/3.westend.paritytech.net/tcp/30334/ws", "/dns/3.westend.paritytech.net/tcp/30333", "/dns/3.westend.paritytech.net/tcp/30334/ws"]    
2022-10-06 13:49:56.413 TRACE tokio-runtime-worker sub-libp2p: Libp2p => Dialing(PeerId("12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K"))    
2022-10-06 13:49:56.414 TRACE tokio-runtime-worker sub-libp2p: Libp2p => Dial failure for PeerId("12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K")    
2022-10-06 13:49:56.414 TRACE tokio-runtime-worker sub-libp2p: PSM <= Dropped(12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K, SetId(0))    
2022-10-06 13:49:56.414 TRACE tokio-runtime-worker sub-libp2p: PSM => Connect(12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K, SetId(0)): Will start to connect at until Instant { tv_sec: 7908, tv_nsec: 768666620 }    
2022-10-06 13:49:56.576 TRACE tokio-runtime-worker sub-libp2p: Libp2p => Connected(12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K, SetId(0), Dialer { address: "/dns/3.westend.paritytech.net/tcp/30333/p2p/12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K" }): Connection was requested by PSM.    
2022-10-06 13:49:56.576 TRACE tokio-runtime-worker sub-libp2p: Handler(PeerId("12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K"), ConnectionId(3)) <= Open(SetId(0))    
2022-10-06 13:49:56.576 TRACE tokio-runtime-worker sub-libp2p: Libp2p => Connected(12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K, SetId(1), Dialer { address: "/dns/3.westend.paritytech.net/tcp/30333/p2p/12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K" }, ConnectionId(3)): Not requested by PSM, disabling.    
2022-10-06 13:49:56.576 TRACE tokio-runtime-worker sub-libp2p: Libp2p => Connected(12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K, SetId(2), Dialer { address: "/dns/3.westend.paritytech.net/tcp/30333/p2p/12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K" }, ConnectionId(3)): Not requested by PSM, disabling.    
2022-10-06 13:49:56.576 TRACE tokio-runtime-worker sub-libp2p: Libp2p => Connected(12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K, SetId(3), Dialer { address: "/dns/3.westend.paritytech.net/tcp/30333/p2p/12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K" }, ConnectionId(3)): Not requested by PSM, disabling.    
2022-10-06 13:49:56.576 TRACE tokio-runtime-worker sub-libp2p: Libp2p => Connected(12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K, SetId(4), Dialer { address: "/dns/3.westend.paritytech.net/tcp/30333/p2p/12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K" }, ConnectionId(3)): Not requested by PSM, disabling.    
2022-10-06 13:49:56.576 DEBUG tokio-runtime-worker sub-libp2p: Libp2p => Connected(PeerId("12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K")) with errors: []    
2022-10-06 13:49:56.621 TRACE tokio-runtime-worker sub-libp2p: Ping time with PeerId("12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K"): 45.006854ms    
2022-10-06 13:49:56.665 TRACE tokio-runtime-worker sub-libp2p: Handler(12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K, ConnectionId(3)) => OpenResultOk(SetId(0))    
2022-10-06 13:49:56.665 TRACE tokio-runtime-worker sub-libp2p: External API <= Open(12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K, SetId(0))    
2022-10-06 13:49:56.665 DEBUG tokio-runtime-worker sync: New peer with best hash 0x861c…bc6c (12771923).    
2022-10-06 13:49:56.665 DEBUG tokio-runtime-worker sync: Connected 12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K    
2022-10-06 13:49:56.665 TRACE tokio-runtime-worker sub-libp2p: PSM => Connect(12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K, SetId(1)): Enabling connections.    
2022-10-06 13:49:56.665 TRACE tokio-runtime-worker sub-libp2p: Handler(PeerId("12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K"), ConnectionId(3)) <= Open(SetId(1))    
2022-10-06 13:49:56.665 TRACE tokio-runtime-worker sub-libp2p: PSM => Connect(12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K, SetId(2)): Enabling connections.    
2022-10-06 13:49:56.665 TRACE tokio-runtime-worker sub-libp2p: Handler(PeerId("12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K"), ConnectionId(3)) <= Open(SetId(2))    
2022-10-06 13:49:56.716 TRACE tokio-runtime-worker sub-libp2p: Handler(PeerId("12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K"), ConnectionId(3)) => OpenResultErr(SetId(1))    
2022-10-06 13:49:56.716 TRACE tokio-runtime-worker sub-libp2p: PSM <= Dropped(PeerId("12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K"))    
2022-10-06 13:49:56.716 TRACE tokio-runtime-worker sub-libp2p: PSM => Connect(12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K, SetId(1)): But peer is backed-off until Instant { tv_sec: 7912, tv_nsec: 70919904 }    
2022-10-06 13:49:56.760 TRACE tokio-runtime-worker sub-libp2p: Handler(PeerId("12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K"), ConnectionId(3)) => OpenResultErr(SetId(2))    
2022-10-06 13:49:56.760 TRACE tokio-runtime-worker sub-libp2p: PSM <= Dropped(PeerId("12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K"))    
2022-10-06 13:49:56.761 TRACE tokio-runtime-worker sub-libp2p: PSM => Connect(12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K, SetId(2)): But peer is backed-off until Instant { tv_sec: 7912, tv_nsec: 115314318 }    

(note: the logs are obviously truncated, but nothing relevant happens for several seconds afterwards)

As you can see, there's no sync: Disconnected 12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K like expected. The sync state machine thinks that we're still peered with the node. And this is confirmed by the fact that the local Substrate node then sends block requests to the bootnode.

The difference on the lower level is that the bootnode closes the block announces substream (SetId(0)), but the local Substrate node doesn't seem to pick up this closing.

tomaka avatar Oct 06 '22 11:10 tomaka

Another maybe important piece of information is that I was originally debugging this between smoldot and the bootnode. The bootnode refuses smoldot (just like it refuses Substrate), and closes the substream, and smoldot picks up the substream closing.

So I would lean towards the bug being on the receiving side, where the substream is actually closed by the bootnode but Substrate doesn't detect it.

tomaka avatar Oct 06 '22 12:10 tomaka

CC @altonen @dmitry-markin something for you to look into. :)

bkchr avatar Oct 06 '22 12:10 bkchr

To debug, I would suggest to start two Substrate nodes, on one of them replace this with if true, connect the nodes together, and check whether what happens in the NotificationsInSubstream and NotificationsOutOpen (in client/network/src/protocol/notifications/upgrade/notifications.rs) of the two nodes matches.

There are normally 3 notification substreams being opened, so if you do println-debugging should probably print the main_protocol_name that we originally pass, as it can be used to match the substreams on both sides.

tomaka avatar Oct 06 '22 13:10 tomaka