Fix spurious runtime not supported errors
A couple of times we get messages on versi that certain runtime requests are not supported. Which is not expected, as the deployed runtime does support them:
2022-08-16 14:46:09 | 2022-08-16 12:46:09.709 WARN tokio-runtime-worker parachain::dispute-coordinator: error=Runtime(RuntimeRequest(NotSupported { runtime_api_name: "candidate_events" }))
-- | --
|
Full logs: https://grafana.parity-mgmt.parity.io/goto/7HD0ZPm4k?orgId=1
This should be investigated as working runtime calls are crucial for disputes to work correctly. Related issue #5882
TODOs:
- [x] Have https://github.com/paritytech/substrate/pull/12707 merged in substrate.
- [ ] Show better error message instead of the misleading
parachain::dispute-coordinator: error=Runtime(RuntimeRequest(NotSupported))https://github.com/paritytech/polkadot/pull/6288 - [ ] Cancel any pending work on ActiveLeavesUpdates/Reorgs in the following subsystems:
- [ ] dispute-coordinator
- [ ] dispute-distribution
- [ ] candidate-backing
- [ ] bitfield-distribution
@tdimitrov I hope your runtime (subsystem) learnings come in handy here :wink:
A bit more context. The error occurs because the runtime version can't be queried:
2022-08-16T16:37:26+03:00 2022-08-16 13:37:26.215 WARN tokio-runtime-worker parachain::runtime-api: cannot query the runtime API version: UnknownBlock: State already discarded for BlockId::Hash(0x07450734a72f7ce4178d9ae40eb437924cc8b7ce2a3c474721d155dd7b80865f)
Logs: https://grafana.parity-mgmt.parity.io/goto/QQ6AWHZVz?orgId=1
Uh the usual 'state discarded' problem. The question is, why would the state be discarded :thinking:
why would the state be discarded
Yes. I think this is the main problem here. The runtime issue is a side effect. I just had a quick look at the issue yesterday, I'll continue investigating.
Potentially related (closing and moving discussion here, to avoid clutter): #5735 - please re-open, if they end up being different issues for some reason.
The question is, why would the state be discarded thinking
Is the queried block a fork that is behind the finalized block? If yes, this is your reason why it is already discarded.
@tdimitrov Good point by @bkchr - checking whether this can indeed be the case. I am hoping we are not building by default on such forks, as that would be wasted effort - but it could happen after the fact:
- We start building on a fork
- Another fork gets finalized
- State is discarded before we access it
Especially for approval checking and disputes this is not too unlikely to happen. Checking subsystems, whether they treat finalized signals properly might be a good first thing to do.
Even if finalized signals are handled properly, this could still happen as we will only handle the signal once we finished our current chunk of work, so we will only learn about the signal once we already ran into the error.
One solution could be to delay pruning by one slot. In any case the above error is misleading and should reflect what is actually happening.
It think this is exactly what happens:
- We start building on a fork
- Another fork gets finalized
- State is discarded before we access it
I traced one of the errors. There were two candidates for the same block number. One of them got finalized while process_active_leaves_update in dispute-coordinator was running for the other one.
@eskimor the delay of 2 ~~secs~~ mins between block import and the error in process_active_leaves_update is normal, right?
2022-08-16 09:32:54.299 INFO tokio-runtime-worker substrate: ✨ Imported #1046270 (0x6733…13f4)
2022-08-16 09:34:54.847 WARN tokio-runtime-worker parachain::runtime-api: cannot query the runtime API version: UnknownBlock: State already discarded for BlockId::Hash(0x673383f344cf4b67c8055df35b044908c7aa88de0f1f5e5d322ad304ecd413f4)
This looks like two minutes to me :thinking:
It is two minutes indeed... isn't it too much?
Yes, the most likely reason is that those messages are unrelated. Otherwise we just found a crucial bug.
Yes, the most likely reason is that those messages are unrelated. Otherwise we just found a crucial bug.
The node was restarted between them. Maybe this is the reason.
I checked the other errors - there are quite a lot of cases where the error occurs after restart. I'll focus on the rest for now.
A small update for this one.
After investigating various occurrences of the error above I think the reason really is that the node tries to do a runtime query for pruned block (as @bkchr suggested). I found two separate events which preceded the error.
The first is node restart which I think doesn't need any further investigation.
The second is this error:
2022-10-20 12:46:09 | 2022-10-20 09:46:09.868 DEBUG tokio-runtime-worker parachain::statement-distribution: our view out-of-sync with active heads; head not found relay_parent=0x32f5…ad80
-- | --
I assume the problem for this is a connectivity but this is just a speculation.
Besides the dispute-coordinator the following subsystems are also affected;
candidate-backing:
2022-10-20 15:02:36 | 2022-10-20 12:02:36.062 WARN tokio-runtime-worker parachain::candidate-backing: Failed to fetch runtime API data for job err=NotSupported { runtime_api_name: "validators" }
-- | --
bitfield-distribution:
2022-10-20 15:02:36 | 2022-10-20 12:02:36.062 WARN tokio-runtime-worker parachain::bitfield-distribution: Failed to fetch basics from runtime API relay_parent=0xd3dc1f4174c90c5cc809c0b8e6ae5e3d544469f1cae9261c42c3754ae5d4a94b err=NotSupported { runtime_api_name: "validators" }
-- | --
dispute-distribution:
2022-10-20 15:02:36 | 2022-10-20 12:02:36.072 WARN tokio-runtime-worker parachain::dispute-distribution: error=Sender(Runtime(RuntimeRequest(NotSupported { runtime_api_name: "session_index_for_child" }))) ctx="on FromOrchestra"
-- | --
I think at this point makes no sense to investigate more occurrences of the error. The focus should be:
- [ ] Figure out why
our view out-of-sync with active headserror occurs. Is it really a connectivity problem?
Avoid working on stale heads in:
- [ ]
dispute-coordinator - [ ]
dispute-distribution - [ ]
candidate-backing - [ ]
bitfield-distribution
Additionally (from comments below):
- [ ] Show better error message instead of misleading
parachain::dispute-coordinator: error=Runtime(RuntimeRequest(NotSupported
Also we should really fix the error. NotSupported is slightly misleading. Thanks Tsveto!