polkadot Fix spurious runtime not supported errors

A couple of times we get messages on versi that certain runtime requests are not supported. Which is not expected, as the deployed runtime does support them:



2022-08-16 14:46:09 | 2022-08-16 12:46:09.709  WARN tokio-runtime-worker parachain::dispute-coordinator: error=Runtime(RuntimeRequest(NotSupported { runtime_api_name: "candidate_events" }))
-- | --
  |

Full logs: https://grafana.parity-mgmt.parity.io/goto/7HD0ZPm4k?orgId=1

This should be investigated as working runtime calls are crucial for disputes to work correctly. Related issue #5882

TODOs:

[x] Have https://github.com/paritytech/substrate/pull/12707 merged in substrate.
[ ] Show better error message instead of the misleading parachain::dispute-coordinator: error=Runtime(RuntimeRequest(NotSupported)) https://github.com/paritytech/polkadot/pull/6288
[ ] Cancel any pending work on ActiveLeavesUpdates/Reorgs in the following subsystems:
- [ ] dispute-coordinator
- [ ] dispute-distribution
- [ ] candidate-backing
- [ ] bitfield-distribution

Aug 16 '22 12:08 eskimor

@tdimitrov I hope your runtime (subsystem) learnings come in handy here :wink:

Aug 16 '22 12:08 eskimor

A bit more context. The error occurs because the runtime version can't be queried:

2022-08-16T16:37:26+03:00	2022-08-16 13:37:26.215  WARN tokio-runtime-worker parachain::runtime-api: cannot query the runtime API version: UnknownBlock: State already discarded for BlockId::Hash(0x07450734a72f7ce4178d9ae40eb437924cc8b7ce2a3c474721d155dd7b80865f)

Logs: https://grafana.parity-mgmt.parity.io/goto/QQ6AWHZVz?orgId=1

Aug 23 '22 12:08 tdimitrov

Uh the usual 'state discarded' problem. The question is, why would the state be discarded :thinking:

Aug 24 '22 14:08 eskimor

why would the state be discarded

Yes. I think this is the main problem here. The runtime issue is a side effect. I just had a quick look at the issue yesterday, I'll continue investigating.

Aug 24 '22 14:08 tdimitrov

Potentially related (closing and moving discussion here, to avoid clutter): #5735 - please re-open, if they end up being different issues for some reason.

Aug 24 '22 15:08 eskimor

The question is, why would the state be discarded thinking

Is the queried block a fork that is behind the finalized block? If yes, this is your reason why it is already discarded.

Aug 24 '22 21:08 bkchr

@tdimitrov Good point by @bkchr - checking whether this can indeed be the case. I am hoping we are not building by default on such forks, as that would be wasted effort - but it could happen after the fact:

We start building on a fork
Another fork gets finalized
State is discarded before we access it

Especially for approval checking and disputes this is not too unlikely to happen. Checking subsystems, whether they treat finalized signals properly might be a good first thing to do.

Even if finalized signals are handled properly, this could still happen as we will only handle the signal once we finished our current chunk of work, so we will only learn about the signal once we already ran into the error.

Aug 25 '22 07:08 eskimor

One solution could be to delay pruning by one slot. In any case the above error is misleading and should reflect what is actually happening.

Aug 25 '22 07:08 eskimor

It think this is exactly what happens:

We start building on a fork

Another fork gets finalized

State is discarded before we access it

I traced one of the errors. There were two candidates for the same block number. One of them got finalized while process_active_leaves_update in dispute-coordinator was running for the other one.

@eskimor the delay of 2 ~~secs~~ mins between block import and the error in process_active_leaves_update is normal, right?

2022-08-16 09:32:54.299  INFO tokio-runtime-worker substrate: ✨ Imported #1046270 (0x6733…13f4)

2022-08-16 09:34:54.847  WARN tokio-runtime-worker parachain::runtime-api: cannot query the runtime API version: UnknownBlock: State already discarded for BlockId::Hash(0x673383f344cf4b67c8055df35b044908c7aa88de0f1f5e5d322ad304ecd413f4)

Sep 05 '22 12:09 tdimitrov

This looks like two minutes to me :thinking:

Sep 05 '22 17:09 eskimor

It is two minutes indeed... isn't it too much?

Sep 05 '22 18:09 tdimitrov

Yes, the most likely reason is that those messages are unrelated. Otherwise we just found a crucial bug.

Sep 06 '22 14:09 eskimor

Yes, the most likely reason is that those messages are unrelated. Otherwise we just found a crucial bug.

The node was restarted between them. Maybe this is the reason.

I checked the other errors - there are quite a lot of cases where the error occurs after restart. I'll focus on the rest for now.

Sep 06 '22 17:09 tdimitrov

A small update for this one.

After investigating various occurrences of the error above I think the reason really is that the node tries to do a runtime query for pruned block (as @bkchr suggested). I found two separate events which preceded the error.

The first is node restart which I think doesn't need any further investigation.

The second is this error:

2022-10-20 12:46:09 | 2022-10-20 09:46:09.868 DEBUG tokio-runtime-worker parachain::statement-distribution: our view out-of-sync with active heads; head not found relay_parent=0x32f5…ad80
-- | --

I assume the problem for this is a connectivity but this is just a speculation.

Besides the dispute-coordinator the following subsystems are also affected;

candidate-backing:

2022-10-20 15:02:36 | 2022-10-20 12:02:36.062  WARN tokio-runtime-worker parachain::candidate-backing: Failed to fetch runtime API data for job err=NotSupported { runtime_api_name: "validators" }
-- | --

bitfield-distribution:

2022-10-20 15:02:36 | 2022-10-20 12:02:36.062  WARN tokio-runtime-worker parachain::bitfield-distribution: Failed to fetch basics from runtime API relay_parent=0xd3dc1f4174c90c5cc809c0b8e6ae5e3d544469f1cae9261c42c3754ae5d4a94b err=NotSupported { runtime_api_name: "validators" }
-- | --

dispute-distribution:

2022-10-20 15:02:36 | 2022-10-20 12:02:36.072  WARN tokio-runtime-worker parachain::dispute-distribution: error=Sender(Runtime(RuntimeRequest(NotSupported { runtime_api_name: "session_index_for_child" }))) ctx="on FromOrchestra"
-- | --

I think at this point makes no sense to investigate more occurrences of the error. The focus should be:

[ ] Figure out why our view out-of-sync with active heads error occurs. Is it really a connectivity problem?

Avoid working on stale heads in:

[ ] dispute-coordinator
[ ] dispute-distribution
[ ] candidate-backing
[ ] bitfield-distribution

Additionally (from comments below):

[ ] Show better error message instead of misleading parachain::dispute-coordinator: error=Runtime(RuntimeRequest(NotSupported

Oct 25 '22 12:10 tdimitrov

Also we should really fix the error. NotSupported is slightly misleading. Thanks Tsveto!

Oct 27 '22 14:10 eskimor