flow-go icon indicating copy to clipboard operation
flow-go copied to clipboard

Unable to retrieve events for certain block heights

Open vishalchangrani opened this issue 9 months ago • 8 comments

🐞 Bug Report

Request to retrieve events for certain block heights fail.

flow events get --start=68225795 --end=68225795 -n mainnet A.d0bcefdf1e67ea85.HWGaragePMV2.AirdropBurn.

:x: Command Error: client: rpc error: code = ResourceExhausted desc = failed to retrieve events from execution nodes: 1 error occurred:
	* rpc error: code = ResourceExhausted desc =

While this is related to the bug https://github.com/dapperlabs/flow-go/issues/6959, it points to a different issue. Currently, EN1 is set to return a ResourceExhausted error when querying for events. However, the fact that the GetEvents call consistently fails indicates that the public access nodes always query only EN1. This would happen if the access node only got one execution receipt for the block and it was from EN1. Hence the core issue here is that access node is most likely missing execution receipts from the other execution nodes.

What is the severity of this bug?

important

Critical - Urgent: We can't do anything if this isn't actioned immediately (product doesn't function without this, it's blocking us or users, or it resolves a high severity security issue). Whole team should drop what they're doing and work on this.

Critical: We can't do anything if this isn't actioned immediately (product doesn't function without this, it's blocking us or users, or it resolves a high severity security issue). One person should look at this right now.

Important: * We have to do this before we ship, but it can wait until the next sprint (product or feature won't function without it, but it's not blocking us or users right now). Team should do this in the next sprint.

Should have: * It would be better if we do this before we ship, but it's OK if we don't (product functions without this, but it's a better user experience). Consider adding to a future sprint.

Could have: It really doesn't matter if we do this (product functions without this, impact to user is minimal).

Reproduction steps

Steps to reproduce the behaviour:

$ flow events get --start=68225796 --end=68225796 -n mainnet A.d0bcefdf1e67ea85.HWGaragePMV2.AirdropBurn

❌ Command Error: client: rpc error: code = ResourceExhausted desc = failed to retrieve events from execution nodes: 1 error occurred:
	* rpc error: code = ResourceExhausted desc = 

Expected behaviour

Events should be returned.

Workaround

Access node 7 and 8 run by the foundation serve events locally and respond without an error for those block heigiths.

$ flow events get --start=68225795 --end=68225795 --host access-008.mainnet24.nodes.onflow.org:9000 A.d0bcefdf1e67ea85.HWGaragePMV2.AirdropBurn


Add any other context about the problem here.

vishalchangrani avatar Apr 29 '24 22:04 vishalchangrani

This PR https://github.com/onflow/flow-go/pull/5764 fixes the issue of EN1 missing events. Once the fix for that is rolled out, the client should not receive an error since EN1 will have all the events. However, the root cause of this issue would still persists and needs to be fixed.

vishalchangrani avatar Apr 29 '24 22:04 vishalchangrani

Next step is reproduce this against a single AN, then inspect the receipts for the block to see how many and from which nodes.

ANs index execution receipts in the ingestion engine here: https://github.com/onflow/flow-go/blob/512eb324ce87889b06c667568a510ef8a7a75644/engine/access/ingestion/engine.go#L435

Then choose an execution node based on receipts in storage here: https://github.com/onflow/flow-go/blob/512eb324ce87889b06c667568a510ef8a7a75644/engine/access/rpc/backend/backend.go#L461

It's possible for an AN to have only received a receipt from a single or even no ENs for a block. In this case, the AN should just try any EN.

I think we're running into a special case in this situation. If an AN is configured with a list of "preferred execution nodes", it will select one or more node from that list has it has receipts from. However, if it returns only a single node and the request to that node fails, it will not retry on another node.

peterargue avatar Apr 29 '24 22:04 peterargue

There are 2 flags an AN can use to control which EN to use:

  • --preferred-execution-node-ids: if this is set the AN will prefer to use a node from this list if it has a receipt from any. Otherwise, it will fallback to using any EN.
  • --fixed-execution-node-ids: if this is set the AN will only use nodes from this list.

Otherwise, the node will try with any execution node.

Here's the logic: https://github.com/onflow/flow-go/blob/6f0e33aa35f007f6b03447e359ea1c0c52780ff6/engine/access/rpc/backend/backend.go#L530-L563

This issue comes up when an access node only has receipts from a single EN. In this case, if that node is offline or returns an error, the AN will not retry on any other node. This can create the situation where data for some blocks effectively becomes unavailable on that node.

ANs receive receipts from ENs as they execute blocks, and from the actual block as they are received from consensus nodes. It's possible in some situations for an AN to only have a single receipt for a block in it's store, so that situation should be handled.

peterargue avatar May 01 '24 00:05 peterargue

I think we should update the behavior when --preferred-execution-node-ids is set and there are less than https://github.com/onflow/flow-go/blob/6f0e33aa35f007f6b03447e359ea1c0c52780ff6/engine/access/rpc/backend/node_communicator.go#L13

nodes selected, that the list is padded up to 3 nodes using the following methods (in order):

  1. Use any EN with a receipt
  2. Use any preferred node not already selected
  3. Use any EN not already selected

This would ensure there are enough fallbacks to handle cases where ENs are unavailable

peterargue avatar May 01 '24 00:05 peterargue

  • Use any EN with a receipt
  • Use any preferred node not already selected
  • Use any EN not already selected

shouldn't the order be,

  1. Use any preferred node not already selected
  2. Use any EN with a receipt
  3. Use any EN not already selected

Since the operator wants the preferred nodes to be given more weightage.

vishalchangrani avatar May 03 '24 16:05 vishalchangrani

my thinking is that "preferred" implies that the node will try to use these if one of these nodes has executed the block, otherwise it will use another node.

If we failed over to any preferred EN, I think we're more likely to see delays responding to queries if there are other ENs that have reported executing. I'm OK with either approach

peterargue avatar May 06 '24 21:05 peterargue

  • Use any EN with a receipt
  • Use any preferred node not already selected
  • Use any EN not already selected

You are right - I mistakenly assumed preferred nodes would always be in the EN receipt. Good with the order you suggested.

One question though - is the AN capable of differentiating between EN responding with a not-found error versus an EN responding with any other error?

vishalchangrani avatar May 10 '24 21:05 vishalchangrani

One question though - is the AN capable of differentiating between EN responding with a not-found error versus an EN responding with any other error?

In some cases it does, but we can certainly add it where needed. Did you have a case in mind that should be checked?

peterargue avatar May 10 '24 22:05 peterargue