polkadot icon indicating copy to clipboard operation
polkadot copied to clipboard

Kusama parachain validating: FetchPoV(NetworkError(NotConnected)) error

Open ScepticMatt opened this issue 3 years ago • 9 comments

During Paravalidating, some of our Kusama validators get the following error. Some have this error more often, some less often, but the average amount of era points is way below it should be. Any help in figuring out the cause for this would be highly appreciated.

Example error log

2022-09-12 12:27:31 fetch_pov_job err=FetchPoV(NetworkError(NotConnected)) pov_hash=0x0d59283be909c1982af6291556e96bb3b6a933010bb476126b67e39070ca7ad4 authority_id=Public(ec6cbf9eeec4912d6c1c04e389a>

The nodes were synced using a RocksDB db from another of our own Validators that runs in "shadow mode", i.e. not in the active set and without a session key. Some validators were using a snapshot while is was only 30 minutes old, some a few days past.

I have confirmed that the P2P port is not blocked by the firewall:

nc -p 12345 -vz <ip> 30333 Connection to <ip> 30333 port [tcp] succeeded!

command line options:

 --chain kusama \
 --base-path "<directory>" \
 --validator \
 --database=RocksDb \
 --pruning 1000 \
 --telemetry-url 'wss://telemetry.polkadot.io/submit/ 1' \
 --telemetry-url 'wss://telemetry-backend.w3f.community/submit 1' \
 --port <port> --ws-port <ws-port> --rpc-port <rpc-port> --name "<name>"

Outside of polkadot and prometheus, nothing is running on the machines.

Environment info

Dedicated Server AMD Ryzen 9 5950X 16-Core Processor (Hyperthreading disabled) 128GiB System Memory (4 x 32GiB DIMM DDR4 3200 MHz) I210 Gigabit Network Connection 2 x SAMSUNG MZQL23T8HCLS-00A07 Ubuntu 22.04.1 LTS Linux version 5.15.0-47-generic (buildd@lcy02-amd64-060) (gcc (Ubuntu 11.2.0-19ubuntu1) 11.2.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #51-Ubuntu SMP Thu Aug 11 07:51:15 UTC 2022

ScepticMatt avatar Sep 13 '22 13:09 ScepticMatt

CC @eskimor

bkchr avatar Sep 14 '22 21:09 bkchr

Can you confirm that the firewall allows outgoing connections to be initiated by the process?

rphmeier avatar Sep 14 '22 21:09 rphmeier

Can you confirm that the firewall allows outgoing connections to be initiated by the process?

nc -p 12345 -vz "snip" 30335 Connection to "snip" 30335 port [tcp/*] succeeded!

ScepticMatt avatar Sep 15 '22 11:09 ScepticMatt

We found out that the networkIDs of some of our nodes were the same due to using a snapshot database. We have corrected this issue, but the error still appears a day later on the logs of some of our nodes after the change. Era points are trending higher after the fix, but still to early to call.

ScepticMatt avatar Sep 15 '22 11:09 ScepticMatt

I can confirm that one of my Kusama validators had the same error while paravalidating. The session went fine otherwise but this error came up about 10 times. Port is also open according to nc.

MathCryptoDoc avatar Sep 16 '22 13:09 MathCryptoDoc

Can confirm that this error still appears on the logs quite often, but it does not affect our era points.

Example log output

2022-09-19 09:46:25 fetch_pov_job err=FetchPoV(NetworkError(NotConnected)) pov_hash=0xc0e13cef06c5aa3d4d16847de29a8ab28ab3cc4472a77cfb5b23ad48e0cbaa61 authority_id=Public(8eda772a122b73fcd5db4ce06b53a8f5b1832ec36bbd2ec64ef0f1eae74de802 (FodDSFMb...))
2022-09-19 09:46:25 fetch_pov_job err=FetchPoV(NetworkError(NotConnected)) pov_hash=0xc0e13cef06c5aa3d4d16847de29a8ab28ab3cc4472a77cfb5b23ad48e0cbaa61 authority_id=Public(9c7c4e290e1d13bc4807ed09f5c85c39c8e02d411b831bbe0961e7a5e4e8ea36 (G7VuybEH...))
2022-09-19 09:46:37 fetch_pov_job err=FetchPoV(NetworkError(NotConnected)) pov_hash=0x0e05d39e3331504f49ff43ffa988d636c5ec85214bce9f3c8530d1556d636266 authority_id=Public(8eda772a122b73fcd5db4ce06b53a8f5b1832ec36bbd2ec64ef0f1eae74de802 (FodDSFMb...))
2022-09-19 09:46:57 fetch_pov_job err=FetchPoV(NetworkError(NotConnected)) pov_hash=0x24508be9e2a76df6ada760d2f0aa9eb004a079adb324f152edeba5833c3947c7 authority_id=Public(8eda772a122b73fcd5db4ce06b53a8f5b1832ec36bbd2ec64ef0f1eae74de802 (FodDSFMb...))
2022-09-19 09:46:57 fetch_pov_job err=FetchPoV(NetworkError(NotConnected)) pov_hash=0x24508be9e2a76df6ada760d2f0aa9eb004a079adb324f152edeba5833c3947c7 authority_id=Public(9c7c4e290e1d13bc4807ed09f5c85c39c8e02d411b831bbe0961e7a5e4e8ea36 (G7VuybEH...))
2022-09-19 09:49:14 fetch_pov_job err=FetchPoV(NetworkError(NotConnected)) pov_hash=0x08d72b662622d5023b0f2e18ad7e34479bcc37e0cd844548fc62e3d88b360cdd authority_id=Public(9c7c4e290e1d13bc4807ed09f5c85c39c8e02d411b831bbe0961e7a5e4e8ea36 (G7VuybEH...))
2022-09-19 09:56:49 fetch_pov_job err=FetchPoV(NetworkError(NotConnected)) pov_hash=0x245d116a09e8ccc5ff0f8385d7a051afdf45e70c8173ff8d038267d3efc477ff authority_id=Public(8eda772a122b73fcd5db4ce06b53a8f5b1832ec36bbd2ec64ef0f1eae74de802 (FodDSFMb...))
2022-09-19 09:56:49 fetch_pov_job err=FetchPoV(NetworkError(NotConnected)) pov_hash=0x245d116a09e8ccc5ff0f8385d7a051afdf45e70c8173ff8d038267d3efc477ff authority_id=Public(9c7c4e290e1d13bc4807ed09f5c85c39c8e02d411b831bbe0961e7a5e4e8ea36 (G7VuybEH...))
2022-09-19 09:57:01 fetch_pov_job err=FetchPoV(NetworkError(NotConnected)) pov_hash=0x9f747e7b78a6b6b24839fdeb76a8bfdb8c9240f84ca2aef42da6f4829ff01e8c authority_id=Public(8eda772a122b73fcd5db4ce06b53a8f5b1832ec36bbd2ec64ef0f1eae74de802 (FodDSFMb...))
2022-09-19 09:57:01 fetch_pov_job err=FetchPoV(NetworkError(NotConnected)) pov_hash=0x9f747e7b78a6b6b24839fdeb76a8bfdb8c9240f84ca2aef42da6f4829ff01e8c authority_id=Public(9c7c4e290e1d13bc4807ed09f5c85c39c8e02d411b831bbe0961e7a5e4e8ea36 (G7VuybEH...))
2022-09-19 10:17:24 fetch_pov_job err=FetchPoV(NetworkError(NotConnected)) pov_hash=0x4ed9de25dfff1eff73ee3c8e802340ea095ad42ea1d28e9147f05762316cb2fe authority_id=Public(8eda772a122b73fcd5db4ce06b53a8f5b1832ec36bbd2ec64ef0f1eae74de802 (FodDSFMb...))
2022-09-19 10:17:24 fetch_pov_job err=FetchPoV(NetworkError(NotConnected)) pov_hash=0x4ed9de25dfff1eff73ee3c8e802340ea095ad42ea1d28e9147f05762316cb2fe authority_id=Public(9c7c4e290e1d13bc4807ed09f5c85c39c8e02d411b831bbe0961e7a5e4e8ea36 (G7VuybEH...))

ScepticMatt avatar Sep 19 '22 13:09 ScepticMatt

Hi @ScepticMatt ! Thanks for the details - from the snippet it looks as this only happens with two particular validators. Can you confirm this or is it just in this particular selection. I just checked our validators - no such entries.

eskimor avatar Sep 19 '22 14:09 eskimor

Hi @ScepticMatt ! Thanks for the details - from the snippet it looks as this only happens with two particular validators. Can you confirm this or is it just in this particular selection. I just checked our validators - no such entries.

No, this error sporadically appears on all our Validators with different authority_ids. We allow incoming connections from any port to the P2P port, and all outgoing connections. Here is an snipped from a Polkadot validator:

2022-09-12 18:22:06 fetch_pov_job err=FetchPoV(NetworkError(NotConnected)) pov_hash=0x0d4324d54c3ffcada5c77f8d3c1b617f0fb85aebc835051502c5caa3d11c0cd9 authority_id=Public(f8bf480b21d5fcfdf7c0ed21b4968982c3b4798fe04b9e9462fe2bd8fb6c511b (16d9hUbM...))
2022-09-12 18:22:06 fetch_pov_job err=FetchPoV(NetworkError(NotConnected)) pov_hash=0x0d4324d54c3ffcada5c77f8d3c1b617f0fb85aebc835051502c5caa3d11c0cd9 authority_id=Public(2ed25c5dd13af6134be006113dd5d261facffcb00b848f2a043e552e183cbd20 (124PhBwp...))
2022-09-12 18:23:32 fetch_pov_job err=FetchPoV(NetworkError(NotConnected)) pov_hash=0x30e1633c4450574e644800167513d2bd57ff713266375114a69cbd5650c58038 authority_id=Public(f8bf480b21d5fcfdf7c0ed21b4968982c3b4798fe04b9e9462fe2bd8fb6c511b (16d9hUbM...))
2022-09-12 18:23:43 fetch_pov_job err=FetchPoV(NetworkError(NotConnected)) pov_hash=0xe68123fda49971d5661793e1a48ecacbbc50ffee5c8f10a3766d1ef877d9ec31 authority_id=Public(f8bf480b21d5fcfdf7c0ed21b4968982c3b4798fe04b9e9462fe2bd8fb6c511b (16d9hUbM...))
2022-09-15 11:10:15 fetch_pov_job err=FetchPoV(NetworkError(NotConnected)) pov_hash=0x8fd43d8ff41bc3393dcb9f28d8836651f9179d2d19f1c8e932fbba437ec07197 authority_id=Public(384967eb1b66db385ef3ccb98a75c82173e1c86cd62415832bc5d8e591b1712f (12GoVHVH...))
2022-09-19 09:56:01 fetch_pov_job err=FetchPoV(NetworkError(NotConnected)) pov_hash=0x6d9f4cd5102888db793ad347f6f61b13f148ca57ac4ed4030f054e1c91d0bce4 authority_id=Public(60d3a60f115c98e714b9ca80bb1bfbe16704fb841cbfb0bedad90d8ee5106d21 (13BxUF8j...))
2022-09-12 18:17:01 fetch_pov_job err=FetchPoV(NetworkError(NotConnected)) pov_hash=0x5556ef69c19ebece5d04bb3ff15cdecaeebd5df845ad7633ffad2fda7ff74de3 authority_id=Public(f8bf480b21d5fcfdf7c0ed21b4968982c3b4798fe04b9e9462fe2bd8fb6c511b (16d9hUbM...))
2022-09-12 18:18:13 fetch_pov_job err=FetchPoV(NetworkError(NotConnected)) pov_hash=0xe1225b3418b8a5e7e20d236656c5c796e5fa196dbe84d563070916134790eeb5 authority_id=Public(f8bf480b21d5fcfdf7c0ed21b4968982c3b4798fe04b9e9462fe2bd8fb6c511b (16d9hUbM...))
2022-09-12 18:18:13 fetch_pov_job err=FetchPoV(NetworkError(NotConnected)) pov_hash=0xe1225b3418b8a5e7e20d236656c5c796e5fa196dbe84d563070916134790eeb5 authority_id=Public(2ed25c5dd13af6134be006113dd5d261facffcb00b848f2a043e552e183cbd20 (124PhBwp...))
2022-09-12 18:21:42 fetch_pov_job err=FetchPoV(NetworkError(NotConnected)) pov_hash=0x83afad24f429c6fb78d8fac2c3d413355565e5a9fa4f36aba562c36e9868fb34 authority_id=Public(f8bf480b21d5fcfdf7c0ed21b4968982c3b4798fe04b9e9462fe2bd8fb6c511b (16d9hUbM...))
2022-09-12 18:21:42 fetch_pov_job err=FetchPoV(NetworkError(NotConnected)) pov_hash=0x83afad24f429c6fb78d8fac2c3d413355565e5a9fa4f36aba562c36e9868fb34 authority_id=Public(2ed25c5dd13af6134be006113dd5d261facffcb00b848f2a043e552e183cbd20 (124PhBwp...))
2022-09-12 18:22:06 fetch_pov_job err=FetchPoV(NetworkError(NotConnected)) pov_hash=0x0d4324d54c3ffcada5c77f8d3c1b617f0fb85aebc835051502c5caa3d11c0cd9 authority_id=Public(f8bf480b21d5fcfdf7c0ed21b4968982c3b4798fe04b9e9462fe2bd8fb6c511b (16d9hUbM...))
2022-09-12 18:22:06 fetch_pov_job err=FetchPoV(NetworkError(NotConnected)) pov_hash=0x0d4324d54c3ffcada5c77f8d3c1b617f0fb85aebc835051502c5caa3d11c0cd9 authority_id=Public(2ed25c5dd13af6134be006113dd5d261facffcb00b848f2a043e552e183cbd20 (124PhBwp...))
2022-09-12 18:23:32 fetch_pov_job err=FetchPoV(NetworkError(NotConnected)) pov_hash=0x30e1633c4450574e644800167513d2bd57ff713266375114a69cbd5650c58038 authority_id=Public(f8bf480b21d5fcfdf7c0ed21b4968982c3b4798fe04b9e9462fe2bd8fb6c511b (16d9hUbM...))
2022-09-12 18:23:43 fetch_pov_job err=FetchPoV(NetworkError(NotConnected)) pov_hash=0xe68123fda49971d5661793e1a48ecacbbc50ffee5c8f10a3766d1ef877d9ec31 authority_id=Public(f8bf480b21d5fcfdf7c0ed21b4968982c3b4798fe04b9e9462fe2bd8fb6c511b (16d9hUbM...))
2022-09-15 11:10:15 fetch_pov_job err=FetchPoV(NetworkError(NotConnected)) pov_hash=0x8fd43d8ff41bc3393dcb9f28d8836651f9179d2d19f1c8e932fbba437ec07197 authority_id=Public(384967eb1b66db385ef3ccb98a75c82173e1c86cd62415832bc5d8e591b1712f (12GoVHVH...))
2022-09-19 09:56:01 fetch_pov_job err=FetchPoV(NetworkError(NotConnected)) pov_hash=0x6d9f4cd5102888db793ad347f6f61b13f148ca57ac4ed4030f054e1c91d0bce4 authority_id=Public(60d3a60f115c98e714b9ca80bb1bfbe16704fb841cbfb0bedad90d8ee5106d21 (13BxUF8j...))

ScepticMatt avatar Sep 20 '22 10:09 ScepticMatt

Interesting. Can you please check the following metrics on those validators:

  • polkadot_parachain_desired_peer_count
  • polkadot_parachain_peer_count

Something is off. Conceptually each validator should be connected to each other validator. This might not be true at startup for example ... are those errors close to a restart or all over the place?

eskimor avatar Sep 21 '22 08:09 eskimor

polkadot_parachain_desired_peer_count: 1009 polkadot_parachain_peer_count: 1041

errors are all over the place. edit: an a kusama validator node

ScepticMatt avatar Sep 23 '22 11:09 ScepticMatt

Thanks! Suggests, the error might be misleading. :thinking: Apart from having those messages, any issues?

eskimor avatar Sep 23 '22 12:09 eskimor

Ok, I can now see those errors on our validator as well. But they happen very rarely (hours in between) and can very easily be explained by validators restarting/updating.

eskimor avatar Sep 23 '22 12:09 eskimor

Thanks! Suggests, the error might be misleading. thinking Apart from having those messages, any issues?

not at the moment, no

ScepticMatt avatar Sep 23 '22 13:09 ScepticMatt

Looking at your logs again, it is indeed quite likely that this is just nodes restarting/updating. As there are no other issues, I am closing - feel free to re-open, if you think this needs more work.

eskimor avatar Sep 23 '22 19:09 eskimor