substrate icon indicating copy to clipboard operation
substrate copied to clipboard

Finality Lagging issues in Kusama archive & full nodes

Open Imod7 opened this issue 2 years ago • 5 comments

Is there an existing issue?

  • [X] I have searched the existing issues

Experiencing problems? Have you tried our Stack Exchange first?

  • [X] This is not a support question.

Description of bug

The Integrations Team at Parity was contacted by a team who is running KSM nodes and they reported sync isssues as soon as they upgraded to v.0.9.37.

A small summary of the issues they faced & the solution they found :

  • While running their KSM nodes in v.0.9.36 everything was very stable and no sync issues.
  • As soon as they upgraded to v.0.9.37 their peer count dropped significantly.
  • They also faced finality lagging and the number of best block was always much larger than the finalized block.
  • Also, the finalized block number was being updated every few minutes.
  • In previous versions these two numbers were updated almost synchronously.
  • They added -l afg=trace,sync=debug flags to get more detailed logs and restarted the node
  • After the restart, the number of peers increased (before the restart only < 3, after > 10).
  • Restart helped for some period of time but the number of peers dropped again, finally only 0 or 1.
  • Again finalized blocks were falling behind.
  • They downgraded to v.0.9.36 and now it looks like there is no lagging.

Their Setup

  • They run Archive & Full Nodes
  • Regarding their Infrastructure, Storage and Memory/CPU resources are dynamically provisioned and scale on-demand therefore system resources do not seem to be a reason.

Flags

Two examples of the flags they are using when running their nodes :

/usr/bin/polkadot --name $NAME_TEST --chain=kusama --database=RocksDb --rpc-cors all --pruning=100000 --prometheus-external --ws-external --rpc-external --wasm-execution Compiled -d /home/kusama/data

and :

--name kusamanetwork --chain=kusama --ws-external --rpc-external --rpc-cors=all --pruning archive

Lagging from Log

One output they shared from their logs that shows the sync issue / lagging :

Idle (4 peers), best: #16372008 (0x17d2…107b), finalized #16371948 (0x6855…620d)

Errors from Logs

After adding -l afg=trace,sync=debug and restarting, they shared their logs so I paste here some of the errors found there :

2023-01-30 10:28:01.569 ERROR tokio-runtime-worker sync: 💔 Called on_block_justification with a bad peer ID
2023-01-30 10:28:02.518 DEBUG tokio-runtime-worker sync: New peer with unknown best hash 0xc94a…812f (16415165), searching for common ancestor.
2023-01-30 10:28:02.518 DEBUG tokio-runtime-worker sync: Connected 12D3KooWHFkuQi1...
2023-01-30 10:28:03.166 DEBUG tokio-runtime-worker sync: Propagating transactions
2023-01-30 10:28:03.813  INFO tokio-runtime-worker substrate: 💤 Idle (1 peers), best: #16415164 (0xd460…70e1), finalized #16415054 (0x3714…0c4d), ⬇ 38.1kiB/s ⬆ 2.0kiB/s
2023-01-30 10:28:04.517 DEBUG tokio-runtime-worker sync: Reannouncing block 0xc94a9a2018e27e320683c6c1bf... is_best: true
2023-01-30 10:28:04.517  INFO tokio-runtime-worker substrate: ✨ Imported #16415165 (0xc94a…812f)
2023-01-30 10:28:04.517 DEBUG tokio-runtime-worker sync: New best block imported 0xc94a9a2018e27e320683c6...
2023-01-30 10:28:04.517 DEBUG tokio-runtime-worker sync: 12D3KooWHF... disconnected
2023-01-30 10:28:04.975 ERROR tokio-runtime-worker sync: 💔 Called on_block_justification with a bad peer ID
2023-01-30 10:28:06.066 DEBUG tokio-runtime-worker sync: Propagating transactions

...

2023-01-30 10:07:25.064  WARN tokio-runtime-worker sync: 💔 Error importing block 0x74bdb741923407487154...: block has an unknown parent
2023-01-30 10:07:25.064 DEBUG tokio-runtime-worker sync: Restarted with 16414958 (0xe9c8…2882)
2023-01-30 10:07:25.064 DEBUG tokio-runtime-worker sync: New peer with known best hash 0x87f2…c82e (16414957).
2023-01-30 10:07:25.064 DEBUG tokio-runtime-worker sync: New peer with unknown best hash 0x2cf8…74d3 (16414958), searching for common ancestor.

If needed, we can share internally the full list of logs cc @IkerAlus @CurlyBracketEffect @Imod7

Similar Past Issues ?

I am adding here some past issues I found that look similar :

Related Current Issue ?

There is one issue that was brought to my attention by @gabreal (devops) and it looks like it is related polkadot-sync-0 + kusama-sync-0 lagging behind #26

Steps to reproduce

  • Upgrade a KSM node from v.0.9.36 to v.0.9.37

Imod7 avatar Feb 02 '23 12:02 Imod7

New peer with unknown best hash 0xc94a…812f (16415165), searching for common ancestor.

There's likely a bug in ancestry search as indicated by https://github.com/paritytech/polkadot-sdk/issues/528

error importing block 0x74bdb741923407487154...: block has an unknown parent

this is interesting because it's not clear to me why the remote would choose to provide us a block with an unknown parent. would be good to see debug logs for this case.

melekes avatar Feb 17 '23 12:02 melekes

this is interesting because it's not clear to me why the remote would choose to provide us a block with an unknown parent. would be good to see debug logs for this case.

The remote is not deciding which blocks it is providing to us. We are actively requesting blocks from the remote. It could either be a bug in sync where we drain the received blocks incorrectly and did not imported the parent before or the parent failed to import. Or we requested something for that we don't have the parent.

bkchr avatar Feb 17 '23 12:02 bkchr

This is probably related to all the other syncing-related issues we've seen recently. We're working on a couple of fixes

altonen avatar Feb 23 '23 08:02 altonen

  • After the restart, the number of peers increased (before the restart only < 3, after > 10).
  • Restart helped for some period of time but the number of peers dropped again, finally only 0 or 1.

I have a similar observation on our node.

Old behavior (v.0.9.36): If p2p port is closed by the firewall or you running behind NAT, the node is working properly. The sync peer number is equal to the value set in the --out-peers=N. New behavior(>v.0.9.37): If p2p port is closed the peer number is jumping between 0 and 10. A node can stale for 10-15 min and then keep syncing.

For example, the screenshot of the node which had an incorrect --listen-addr flag (wrong port) and "--in-peers=150" "--out-peers=150" Old behavior v.0.9.36: sync peers = 150-160

New behavior:(v.0.9.37) sync peers = jumping between 0-10 Screenshot from 2023-03-06 19-12-22

Fix: After flag --listen-addr was fixed the sync peers grew to 300. Zoom of fix moment: image1

BulatSaif avatar Mar 06 '23 17:03 BulatSaif

@BulatSaif thank you this is very valuable information

I wonder if this has something to do with libp2p because it was update for v0.9.37

altonen avatar Mar 06 '23 18:03 altonen

@BulatSaif thank you this is very valuable information

I wonder if this has something to do with libp2p because it was update for v0.9.37

@altonen maybe we should test if downgrading fixes it and then find out what the difference is.

bkchr avatar Mar 16 '23 17:03 bkchr

@BulatSaif thank you this is very valuable information I wonder if this has something to do with libp2p because it was update for v0.9.37

@altonen maybe we should test if downgrading fixes it and then find out what the difference is.

I've been prioritizing the CPU usage issue since adjusting peer ratios should've bought us some time with this "node keeps losing peers" issue and because I believe we can deploy a fix the CPU issue for relatively soon that would take some of the load off of NetworkWorker but I will get back to this issue soon. I hope it's something we did in Substrate and not something that changed in libp2p.

altonen avatar Mar 16 '23 17:03 altonen

The same team reported low peer issues this morning after upgrading their KSM nodes to v0.9.41. Here are a few logs that they shared :

2023-04-19 04:04:30 💤 Idle (1 peers), best: #17545073 (0x1f8f…fa20), finalized #17544946 (0x4388…d267), ⬇ 7.3kiB/s ⬆ 2.3kiB/s
2023-04-19 04:04:35 💤 Idle (1 peers), best: #17545073 (0x1f8f…fa20), finalized #17544946 (0x4388…d267), ⬇ 1.3kiB/s ⬆ 1.7kiB/s
2023-04-19 04:04:37 ✨ Imported #17545074 (0x0455…6689)
2023-04-19 04:04:40 💤 Idle (1 peers), best: #17545074 (0x0455…6689), finalized #17544946 (0x4388…d267), ⬇ 21.8kiB/s ⬆ 1.5kiB/s
2023-04-19 04:04:40 ✨ Imported #17545075 (0x0b0d…4427)
2023-04-19 04:04:45 💤 Idle (0 peers), best: #17545075 (0x0b0d…4427), finalized #17544946 (0x4388…d267), ⬇ 6.2kiB/s ⬆ 1.4kiB/s
2023-04-19 04:04:50 💤 Idle (0 peers), best: #17545075 (0x0b0d…4427), finalized #17544946 (0x4388…d267), ⬇ 0.9kiB/s ⬆ 3.7kiB/s
2023-04-19 04:04:52 ✨ Imported #17545076 (0x6821…5488)
2023-04-19 04:04:52 ✨ Imported #17545077 (0xb928…4823)

I will share more detailed logs as soon as/if they are provided.

One interesting fact was that these issues appear only in the US and the Tokyo region. However, their nodes running in the EU (Ireland) are stable enough and they have at least 9-15 peers connected at any given time.

Imod7 avatar Apr 19 '23 08:04 Imod7

There will be a PSA related to these low peer counts with the next release. Basically if the team is experiencing this and they're running the nodes behind a NAT, it would be a good idea to test opening the P2P port for inbound connections. The reason why the node is only able to have 0-1 peers is because the public nodes reachable to it are fully occupied, that is not an issue with the node implementation but that the network has too many nodes. If they have the P2P port open for inbound connections and they're still experiencing this, then the issue is something else. In that case, -lsync,sub-libp2p=trace logs are needed

altonen avatar Apr 19 '23 08:04 altonen

Thank you @altonen for the quick feedback! I forwarded your suggestions to the three teams who are experiencing the same/similar issue (low peer count/syn issue in v0.9.41).

  1. First team cannot make any of the changes you suggest due to strict internal security restrictions. We are waiting for the detailed logs after adding trace flags when running their node.

  2. I did not have any feedback from the second team on whether they will implement your suggestions. However, earlier they shared their flags and logs so I am adding them here in case it is helpful for you.

    "--chain=kusama", "--no-prometheus", "--no-telemetry", "--unsafe-rpc-external", "--unsafe-ws-external", "--rpc-methods=unsafe", "--rpc-cors=all", "--pruning=archive", "--in-peers=200", "--out-peers=50", "--max-parallel-downloads=30", "-lsync,sub-libp2p=trace"
    

    Their sync will still get stuck at a certain height and continue to synchronize for a long time.

    Their logs after adding trace flags ksm1.log

  3. The third team will try your suggestion now so I will keep you updated on the impact that this may have.

Imod7 avatar Apr 19 '23 10:04 Imod7

If they can't open the inbound port, even temporarily to rule out that it's not an issue with inability to accept inbound connections, there's very little we can do about this. If they can give at least trace logs, there might be something there.

I looked at the trace logs and they seem to have a same problem: --out-peers 50 (which btw please don't do, the default values should be kept and --in-peers should be increased) but they still have very often between 0-5 peers. After v0.9.42 is released, it may drop even further because SyncingEngine will keep evicting idle peers.

I await for a response from the third team.

altonen avatar Apr 19 '23 11:04 altonen

First team also shared logs after adding -l afg=trace,sync=debug flags : ksm-v0.9.41-low-peers.log.tar.gz

  • There is this error : ERROR tokio-runtime-worker sync: 💔 Called on_block_justification with a peer ID of an unknown peer \n","stream":"stderr","time":"2023-04-19T..."} which I think you are already aware of since it was also mentioned in this issue #6228

  • and the same warning mentioned in the description : WARN tokio-runtime-worker sync: 💔 Error importing block 0x7026bac89f13cf68074ef15......: block has an unknown parent \n","stream":"stderr","time":"2023-04-19...."}

Imod7 avatar Apr 20 '23 09:04 Imod7

The first issue should be fixed by this and it will be included in the next release.

In the logs there is this pattern that repeats constantly:

2023-04-20 03:01:43.409  INFO substrate: 💤 Idle (4 peers), best: #17558784 (0xca4e…dae8), finalized #17558755 (0xe754…58bb), ⬇ 2.0kiB/s ⬆ 2.0kiB/s
2023-04-20 03:01:44.393 DEBUG sync: New peer with unknown best hash 0x1641…7b9b (17558785), searching for common ancestor.
2023-04-20 03:01:44.393 DEBUG sync: Connected 12D3KooWGBnyipqbontcHFyXP4Z991T1iG6fPiDrPZ9UrcWmwusY
2023-04-20 03:01:45.288 DEBUG sync: New peer with unknown best hash 0x1641…7b9b (17558785), searching for common ancestor.
2023-04-20 03:01:45.288 DEBUG sync: Connected 12D3KooWQnHNy9p4Ydrd9BkE7oHfVgHf1taAWwWLPMkNgUvq2fdc
2023-04-20 03:01:45.688  INFO substrate: ✨ Imported #17558785 (0x1641…7b9b)
2023-04-20 03:01:45.688 DEBUG sync: Reannouncing block 0x16417327dc9dfa207a035124dd6190f851f005e2f43a7080addb12745a937b9b is_best: true
2023-04-20 03:01:45.688 DEBUG sync: New best block imported 0x16417327dc9dfa207a035124dd6190f851f005e2f43a7080addb12745a937b9b/#17558785
2023-04-20 03:01:45.689 DEBUG sync: 12D3KooWDUJZ7TpuPuCTxGfp2Y5gxApnJQS88p2pvZmcZQPd6NiH disconnected
2023-04-20 03:01:45.689 DEBUG sync: 12D3KooWBsJKGJFuv83ixryzMsUS53A8JzEVeTA8PGi4U6T2dnif disconnected
2023-04-20 03:01:45.689 DEBUG sync: 12D3KooWQnHNy9p4Ydrd9BkE7oHfVgHf1taAWwWLPMkNgUvq2fdc disconnected
2023-04-20 03:01:45.689 DEBUG sync: 12D3KooWEpotdFJjvs5tLopnJoM6YpyeKJgSPodJykS1DFHKhU6X disconnected
2023-04-20 03:01:45.689 DEBUG sync: 12D3KooWGBnyipqbontcHFyXP4Z991T1iG6fPiDrPZ9UrcWmwusY disconnected
2023-04-20 03:01:45.689 DEBUG sync: 12D3KooWFj2ndawdYyk2spc42Y2arYwb2TUoHLHFAsKuHRzWXwoJ disconnected

The node imports a block, announces it and soon after peers start disconnecting. This is because right now the SyncingEngine falsely believes it has paired with these nodes when in fact it has been rejected, most likely because the full node slots of the remote peers are already occupied and the outbound substream from local node's PoV is closed. It is not detected until the node actually sends something so it then tries to announce this new best block to its peers and notices that the substream is closed and removes the peers.

I believe the underlying problem to be with fully occupied slots. The fact that it's observable in Japan and the US but not in Europe is interesting and I'll ask DevOps if we could have access to a VM in these problem regions.

altonen avatar Apr 20 '23 09:04 altonen

ksm-low-peers.log parity/polkadot:v0.9.42, deployed in AWS ap-northeast-1 Tokyo.

Hey team, we're still having the same sync issue after we opened the 30333 port. The logs with -l afg=trace,sync=debug flags are above.

Full command: --chain kusama --unsafe-rpc-external --unsafe-ws-external --rpc-cors=all --pruning=100000 -l afg=trace,sync=debug

johnsonco avatar May 16 '23 09:05 johnsonco

@johnsonco

can you run the node with -lsync,sub-libp2p=trace? You're running v0.9.42?

altonen avatar May 16 '23 09:05 altonen

@johnsonco reported that their node is syncing properly now. It required approximately 12 hours after they opened the p2p port to sync correctly. @altonen I shared the logs (with -lsync,sub-libp2p=trace flags) of the team in this google drive folder

Imod7 avatar May 17 '23 09:05 Imod7

It is quite interesting, the node is getting block announcements and its best block is increasing but the finalized block is not. There is another bug report opened today that is very similar to this: https://github.com/paritytech/polkadot-sdk/issues/632

There's a suspicious pattern in the logs, the nodes keeps exchanging some message(s) over the GRANDPA substream hundreds of times a second:

{10:54:22.835 sub-libp2p: Handler(PeerId(\"12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2\")) \u003c= Sync notification
{10:54:22.835 sub-libp2p: External API =\u003e Notification(PeerId(\"12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2\"), OnHeap(\"/b0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe/grandpa/1\"), 150 bytes)
{10:54:22.835 sub-libp2p: Handler(PeerId(\"12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2\")) \u003c= Sync notification
{10:54:22.835 sub-libp2p: External API =\u003e Notification(PeerId(\"12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2\"), OnHeap(\"/b0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe/grandpa/1\"), 150 bytes)
{10:54:22.835 sub-libp2p: Handler(PeerId(\"12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2\")) \u003c= Sync notification
{10:54:22.835 sub-libp2p: External API =\u003e Notification(PeerId(\"12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2\"), OnHeap(\"/b0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe/grandpa/1\"), 150 bytes)
{10:54:22.835 sub-libp2p: Handler(PeerId(\"12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2\")) \u003c= Sync notification
{10:54:22.835 sub-libp2p: External API =\u003e Notification(PeerId(\"12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2\"), OnHeap(\"/b0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe/grandpa/1\"), 150 bytes)
{10:54:22.835 sub-libp2p: Handler(PeerId(\"12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2\")) \u003c= Sync notification
{10:54:22.836 sub-libp2p: External API =\u003e Notification(PeerId(\"12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2\"), OnHeap(\"/b0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe/grandpa/1\"), 150 bytes)
{10:54:22.836 sub-libp2p: Handler(PeerId(\"12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2\")) \u003c= Sync notification
{10:54:22.845 sub-libp2p: Handler(ConnectionId(180)) =\u003e Notification(12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2, SetId(2), 150 bytes)
{10:54:22.845 sub-libp2p: External API \u003c= Message(12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2, SetId(2))
{10:54:22.845 sub-libp2p: Handler(ConnectionId(180)) =\u003e Notification(12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2, SetId(2), 150 bytes)
{10:54:22.845 sub-libp2p: External API \u003c= Message(12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2, SetId(2))
{10:54:22.845 sub-libp2p: Handler(ConnectionId(180)) =\u003e Notification(12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2, SetId(2), 150 bytes)
{10:54:22.845 sub-libp2p: External API \u003c= Message(12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2, SetId(2))
{10:54:22.845 sub-libp2p: Handler(ConnectionId(180)) =\u003e Notification(12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2, SetId(2), 150 bytes)
{10:54:22.845 sub-libp2p: External API \u003c= Message(12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2, SetId(2))
{10:54:22.845 sub-libp2p: Handler(ConnectionId(180)) =\u003e Notification(12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2, SetId(2), 150 bytes)
{10:54:22.845 sub-libp2p: External API \u003c= Message(12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2, SetId(2)
{10:54:22.845 sub-libp2p: Handler(ConnectionId(180)) =\u003e Notification(12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2, SetId(2), 150 bytes)
{10:54:22.845 sub-libp2p: External API \u003c= Message(12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2, SetId(2))
{10:54:22.845 sub-libp2p: Handler(ConnectionId(180)) =\u003e Notification(12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2, SetId(2), 150 bytes)
{10:54:22.845 sub-libp2p: External API \u003c= Message(12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2, SetId(2))

So it could be that it's actually getting finality notifications but they're buried under whatever this flood of other messages is. Then at some point 12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2 disconnects which could be either a normal disconnection or the the local node got banned if it advertised duplicate messages.

@andresilva do you have any thoughts on this?

altonen avatar May 17 '23 11:05 altonen

The grandpa log target was changed from afg, so I'd need some logs with -lgrandpa=trace to be able to figure out what's happening there.

andresilva avatar May 17 '23 11:05 andresilva

The team shared new logs (with -lsync,sub-libp2p=trace and -lgrandpa=trace flags). The new file ksm-logs2.tar.gz is uploaded in the same google drive folder.

Please note that the new logs is after the issue was solved so the node no longer had the peer issue. However, I still requested the file in case it can show any interesting pattern/weird behaviour that is not causing any issues now but might affect the peer count again in the future.

Imod7 avatar May 18 '23 10:05 Imod7

Indeed these logs don't show anything abnormal but that's expected since everything is working properly there. Could still be helpful to keep running with those log targets in case the issue happens again.

So it could be that it's actually getting finality notifications but they're buried under whatever this flood of other messages is. Then at some point 12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2 disconnects which could be either a normal disconnection or the the local node got banned if it advertised duplicate messages.

@andresilva do you have any thoughts on this?

I have confirmed that these 150 byte messages are indeed GRANDPA votes. Since we didn't have grandpa=trace logs I can't tell whether those are duplicate / stale votes that some peer is spamming us or not, if they were then the disconnection could be legitimately triggered by us (due to peer reputation getting too low).

andresilva avatar May 18 '23 11:05 andresilva

Team, we are facing the same issue, how do we open the p2p port?

bhargavtheertham-cb avatar May 18 '23 21:05 bhargavtheertham-cb

The P2P port by default should be 30333. You have to manually open it in the firewall/NAT settings to allow inbound connections.

An example that I have at hand is from the ansible scripts I used to setup my validator : p2p_port: 30333 and: ufw allow {{ p2p_port }}/tcp However, this is because I use ufw firewall so in your case it depends on what setup you have.

Imod7 avatar May 19 '23 06:05 Imod7

Since the node is working fine, I'm not sure how much interesting information there is about a possible issue but sadly these logs don't help me debug this further. GRANDPA just tells it received a message but I don't know what that message is and whether it's expected behavior.

07:07:40.150 sub-libp2p: Handler(PeerId(\"12D3KooWKdovqTGfjXRdtoSG4QY5ctGQprUsuReYxyujgE16ZcQa\")) \u003c= Sync notification
07:07:40.150 sub-libp2p: Handler(ConnectionId(108)) =\u003e Notification(12D3KooWC4iEzYnBiQFjZ5GRY84Wc6s7BytAyYc3zky3mUf7yTAt, SetId(2), 150 bytes
07:07:40.150 sub-libp2p: External API \u003c= Message(12D3KooWC4iEzYnBiQFjZ5GRY84Wc6s7BytAyYc3zky3mUf7yTAt, SetId(2))
07:07:40.151 sub-libp2p: Handler(ConnectionId(108)) =\u003e Notification(12D3KooWC4iEzYnBiQFjZ5GRY84Wc6s7BytAyYc3zky3mUf7yTAt, SetId(2), 150 bytes)
07:07:40.151 sub-libp2p: External API =\u003e Notification(PeerId(\"12D3KooWKdovqTGfjXRdtoSG4QY5ctGQprUsuReYxyujgE16ZcQa\"), OnHeap(\"/b0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe/grandpa/1\"), 150 bytes)
07:07:40.151 sub-libp2p: Handler(PeerId(\"12D3KooWKdovqTGfjXRdtoSG4QY5ctGQprUsuReYxyujgE16ZcQa\")) \u003c= Sync notification
07:07:40.151 sub-libp2p: External API =\u003e Notification(PeerId(\"12D3KooWKdovqTGfjXRdtoSG4QY5ctGQprUsuReYxyujgE16ZcQa\"), OnHeap(\"/b0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe/grandpa/1\"), 150 bytes)
07:07:40.151 sub-libp2p: Handler(PeerId(\"12D3KooWKdovqTGfjXRdtoSG4QY5ctGQprUsuReYxyujgE16ZcQa\")) \u003c= Sync notification
07:07:40.151 sub-libp2p: External API =\u003e Notification(PeerId(\"12D3KooWKdovqTGfjXRdtoSG4QY5ctGQprUsuReYxyujgE16ZcQa\"), OnHeap(\"/b0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe/grandpa/1\"), 150 bytes)
07:07:40.151 sub-libp2p: Handler(PeerId(\"12D3KooWKdovqTGfjXRdtoSG4QY5ctGQprUsuReYxyujgE16ZcQa\")) \u003c= Sync notification
07:07:40.151 grandpa: Polling round 559, state = State { prevote_ghost: None, finalized: None, estimate: None, completable: false }, step = Some(Prevoted)
07:07:40.151 grandpa: Round 559: Got incoming message
07:07:40.151 grandpa: Round 559: Got incoming message
07:07:40.151 grandpa: Round 559: Got incoming message
07:07:40.151 grandpa: Round 559: Got incoming message
07:07:40.151 grandpa: Round 559: Got incoming message
07:07:40.151 grandpa: Round 559: Got incoming message
07:07:40.151 grandpa: Round 559: Got incoming message
07:07:40.151 grandpa: Round 559: Got incoming message
07:07:40.151 grandpa: Round 559: Got incoming message
07:07:40.151 grandpa: Round 559: Got incoming message
07:07:40.151 grandpa: Round 559: Got incoming message
07:07:40.151 sub-libp2p: External API =\u003e Notification(PeerId(\"12D3KooWKdovqTGfjXRdtoSG4QY5ctGQprUsuReYxyujgE16ZcQa\"), OnHeap(\"/b0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe/grandpa/1\"), 150 bytes)
07:07:40.151 sub-libp2p: Handler(PeerId(\"12D3KooWKdovqTGfjXRdtoSG4QY5ctGQprUsuReYxyujgE16ZcQa\")) \u003c= Sync notification
07:07:40.151 sub-libp2p: External API =\u003e Notification(PeerId(\"12D3KooWKdovqTGfjXRdtoSG4QY5ctGQprUsuReYxyujgE16ZcQa\"), OnHeap(\"/b0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe/grandpa/1\"), 150 bytes)
07:07:40.151 sub-libp2p: Handler(PeerId(\"12D3KooWKdovqTGfjXRdtoSG4QY5ctGQprUsuReYxyujgE16ZcQa\")) \u003c= Sync notification
07:07:40.151 sub-libp2p: External API \u003c= Message(12D3KooWC4iEzYnBiQFjZ5GRY84Wc6s7BytAyYc3zky3mUf7yTAt, SetId(2))
07:07:40.151 sub-libp2p: External API =\u003e Notification(PeerId(\"12D3KooWKdovqTGfjXRdtoSG4QY5ctGQprUsuReYxyujgE16ZcQa\"), OnHeap(\"/b0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe/grandpa/1\"), 150 bytes)
07:07:40.151 sub-libp2p: Handler(PeerId(\"12D3KooWKdovqTGfjXRdtoSG4QY5ctGQprUsuReYxyujgE16ZcQa\")) \u003c= Sync notification
07:07:40.152 sub-libp2p: Handler(ConnectionId(108)) =\u003e Notification(12D3KooWC4iEzYnBiQFjZ5GRY84Wc6s7BytAyYc3zky3mUf7yTAt, SetId(2), 150 bytes)
07:07:40.152 sub-libp2p: External API \u003c= Message(12D3KooWC4iEzYnBiQFjZ5GRY84Wc6s7BytAyYc3zky3mUf7yTAt, SetId(2))

altonen avatar May 21 '23 08:05 altonen

The team of @bhargavtheerthamcb is also experiencing the peer/sync issue.

Summary of their Setup & Changes

  • They were setting up a new set of polkadot & kusama nodes in Kubernetes and they noticed low peers. Nodes location : US.
  • They have other polkadot nodes not running in K8s and those nodes do not have the peer problem. Configuration between the two setups are same except the nodes not running in K8s have more CPU/memory.
  • In K8s they are starting with 5 cores and 25Gb of RAM but neither is being fully utilized as per our last communication.
  • They are running the nodes with the commands
 --rpc-port 19933 --rpc-external --ws-external --rpc-cors all --no-telemetry --pruning archive --chain=polkadot --in-peers 50 --out-peers 50

--rpc-port 19933 --rpc-external --ws-external --rpc-cors all --no-telemetry --pruning archive --chain=kusama --in-peers 50 --out-peers 50

  • I indicated to change the values of in and out peers as per the suggestion here and open the p2p port.
  • They reported that they opened the p2p port, 30333.
  • The issue of low peers is persisting.
  • Their logs (with -lgrandpa=traceand -lsync,sub-libp2p=trace) were uploaded

Imod7 avatar May 21 '23 09:05 Imod7

I looked at the logs and the node is only establishing outbound connections, i.e., the role is always Dialer meaning there are seemingly no inbound connections to the node. The problem is probably in the Kubernetes configuration, please ensure that the nodes are able to accept inbound connections. You can test this relatively easily by spinning up the cluster and then from the outside, try and connect to one of the nodes running inside of it.

altonen avatar May 21 '23 09:05 altonen

@Imod7 which helm chart are they using to set up a node in Kubernetes? if it is this paritytech/helm-charts try to set up:

node: 
  perNodeServices:
    relayP2pService:
        enabled: true
    setPublicAddressToExternalIp:
      enabled: true

This will proxy port 30333 as Kubernetes NodePort and set --public-addr to correct value:

--public-addr=/ip4/${EXTERNAL_IP}/tcp/${RELAY_CHAIN_P2P_PORT}

BulatSaif avatar May 21 '23 10:05 BulatSaif

@altonen , upon deeper investigation, we are noticing the same finality lag and syncing issues in our non - k8s env also. We noticed that our deploy times ( restore a latest snapshot and catch up to tip) is not taking several days whereas a month or two ago it consistently completed in 30 mins or so. In the non k8s env, we didnt have 30333 open and it had worked fine for couple yrs. Is opening the port 30333 a new requirement? Also curious on how opening the port helps keep stable peers.

bhargavtheertham-cb avatar May 21 '23 18:05 bhargavtheertham-cb

Are you able to pinpoint a release when the deploy times started to degrade?

I'm not sure if we're talking about the same thing but opening the P2P port is not a requirement but if you have a low number of peers and that negatively affects the node, then opening the port should help with that.

altonen avatar May 22 '23 07:05 altonen

As discussed in https://github.com/paritytech/polkadot/issues/7259 I'm providing the startup logs and first few minutes of grandpa trace.

The node is generating approx 100k lines of log per minute with -lgrandpa=trace.

The following is a graylog export (just in case you wonder about the format) https://drive.google.com/file/d/1-Cx-lKDQsuuLrapIKdIilOS4OxO-9DBm/view?usp=sharing

SimonKraus avatar May 22 '23 21:05 SimonKraus

@altonen the issue seems to have started with 0.9.41. We are still not seeing any improvement in our nodes' stability / availablity. We are using our own helm chart and not using the above one. We have opened up the P2P port but it doesnt seem to help. Validated that port is open

telnet xx.xx.xx.xx 30333 Trying xx.xx.xx.xx .. Connected to ipxx.xx.xx.xx ec2.internal. Escape character is '^]'.

We had zero issues with peering or node stability before 0.9.41 and the nodes were very stable. Are other customers having the same issue with nodes ? This is becoming very critiical for us.

bhargavtheertham-cb avatar May 27 '23 20:05 bhargavtheertham-cb