flow-go
flow-go copied to clipboard
Unable to retrieve events for certain block heights
🐞 Bug Report
Request to retrieve events for certain block heights fail.
flow events get --start=68225795 --end=68225795 -n mainnet A.d0bcefdf1e67ea85.HWGaragePMV2.AirdropBurn.
:x: Command Error: client: rpc error: code = ResourceExhausted desc = failed to retrieve events from execution nodes: 1 error occurred:
* rpc error: code = ResourceExhausted desc =
While this is related to the bug https://github.com/dapperlabs/flow-go/issues/6959, it points to a different issue.
Currently, EN1 is set to return a ResourceExhausted
error when querying for events. However, the fact that the GetEvents call consistently fails indicates that the public access nodes always query only EN1. This would happen if the access node only got one execution receipt for the block and it was from EN1. Hence the core issue here is that access node is most likely missing execution receipts from the other execution nodes.
What is the severity of this bug?
important
Critical - Urgent: We can't do anything if this isn't actioned immediately (product doesn't function without this, it's blocking us or users, or it resolves a high severity security issue). Whole team should drop what they're doing and work on this.
Critical: We can't do anything if this isn't actioned immediately (product doesn't function without this, it's blocking us or users, or it resolves a high severity security issue). One person should look at this right now.
Important: * We have to do this before we ship, but it can wait until the next sprint (product or feature won't function without it, but it's not blocking us or users right now). Team should do this in the next sprint.
Should have: * It would be better if we do this before we ship, but it's OK if we don't (product functions without this, but it's a better user experience). Consider adding to a future sprint.
Could have: It really doesn't matter if we do this (product functions without this, impact to user is minimal).
Reproduction steps
Steps to reproduce the behaviour:
$ flow events get --start=68225796 --end=68225796 -n mainnet A.d0bcefdf1e67ea85.HWGaragePMV2.AirdropBurn
❌ Command Error: client: rpc error: code = ResourceExhausted desc = failed to retrieve events from execution nodes: 1 error occurred:
* rpc error: code = ResourceExhausted desc =
Expected behaviour
Events should be returned.
Workaround
Access node 7 and 8 run by the foundation serve events locally and respond without an error for those block heigiths.
$ flow events get --start=68225795 --end=68225795 --host access-008.mainnet24.nodes.onflow.org:9000 A.d0bcefdf1e67ea85.HWGaragePMV2.AirdropBurn
Add any other context about the problem here.
This PR https://github.com/onflow/flow-go/pull/5764 fixes the issue of EN1 missing events. Once the fix for that is rolled out, the client should not receive an error since EN1 will have all the events. However, the root cause of this issue would still persists and needs to be fixed.
Next step is reproduce this against a single AN, then inspect the receipts for the block to see how many and from which nodes.
ANs index execution receipts in the ingestion engine here: https://github.com/onflow/flow-go/blob/512eb324ce87889b06c667568a510ef8a7a75644/engine/access/ingestion/engine.go#L435
Then choose an execution node based on receipts in storage here: https://github.com/onflow/flow-go/blob/512eb324ce87889b06c667568a510ef8a7a75644/engine/access/rpc/backend/backend.go#L461
It's possible for an AN to have only received a receipt from a single or even no ENs for a block. In this case, the AN should just try any EN.
I think we're running into a special case in this situation. If an AN is configured with a list of "preferred execution nodes", it will select one or more node from that list has it has receipts from. However, if it returns only a single node and the request to that node fails, it will not retry on another node.
There are 2 flags an AN can use to control which EN to use:
-
--preferred-execution-node-ids
: if this is set the AN will prefer to use a node from this list if it has a receipt from any. Otherwise, it will fallback to using any EN. -
--fixed-execution-node-ids
: if this is set the AN will only use nodes from this list.
Otherwise, the node will try with any execution node.
Here's the logic: https://github.com/onflow/flow-go/blob/6f0e33aa35f007f6b03447e359ea1c0c52780ff6/engine/access/rpc/backend/backend.go#L530-L563
This issue comes up when an access node only has receipts from a single EN. In this case, if that node is offline or returns an error, the AN will not retry on any other node. This can create the situation where data for some blocks effectively becomes unavailable on that node.
ANs receive receipts from ENs as they execute blocks, and from the actual block as they are received from consensus nodes. It's possible in some situations for an AN to only have a single receipt for a block in it's store, so that situation should be handled.
I think we should update the behavior when --preferred-execution-node-ids
is set and there are less than
https://github.com/onflow/flow-go/blob/6f0e33aa35f007f6b03447e359ea1c0c52780ff6/engine/access/rpc/backend/node_communicator.go#L13
nodes selected, that the list is padded up to 3 nodes using the following methods (in order):
- Use any EN with a receipt
- Use any preferred node not already selected
- Use any EN not already selected
This would ensure there are enough fallbacks to handle cases where ENs are unavailable
- Use any EN with a receipt
- Use any preferred node not already selected
- Use any EN not already selected
shouldn't the order be,
- Use any preferred node not already selected
- Use any EN with a receipt
- Use any EN not already selected
Since the operator wants the preferred nodes to be given more weightage.
my thinking is that "preferred" implies that the node will try to use these if one of these nodes has executed the block, otherwise it will use another node.
If we failed over to any preferred EN, I think we're more likely to see delays responding to queries if there are other ENs that have reported executing. I'm OK with either approach
- Use any EN with a receipt
- Use any preferred node not already selected
- Use any EN not already selected
You are right - I mistakenly assumed preferred nodes
would always be in the EN receipt.
Good with the order you suggested.
One question though - is the AN capable of differentiating between EN responding with a not-found error versus an EN responding with any other error?
One question though - is the AN capable of differentiating between EN responding with a not-found error versus an EN responding with any other error?
In some cases it does, but we can certainly add it where needed. Did you have a case in mind that should be checked?