flow-go icon indicating copy to clipboard operation
flow-go copied to clipboard

[Access] Ingestion engine may fail to create block/result index mapping

Open peterargue opened this issue 1 year ago • 1 comments

Problem Definition

The Access node's ingestion engine consumes block finalized events and indexes the sealed execution results contained in each block. These index entries are later used by the GetExecutionResultByBlockID endpoint. https://github.com/onflow/flow-go/blob/8b2113a309d2aa175d75a49fcee171083d63acac/engine/access/ingestion/engine.go#L418-L424

This indexing logic is triggered by new block finalized events. If an error is encountered before or during indexing, it simply logs the message and continues without retrying https://github.com/onflow/flow-go/blob/8b2113a309d2aa175d75a49fcee171083d63acac/engine/access/ingestion/engine.go#L333-L336

Additionally, if the node crashes or is stopped between when a block is finalized and the ingestion engine indexes the result, it will be missed. This "best effort" logic can result is missed index entries causing GetExecutionResultByBlockID to return NotFound for execution results it has in storage.

Proposed Solution

Refactor the ingestion engine to make use of a jobqueue to ensure that each block height is processed. This should handle the case when the node crashes or restarts and retry on error.

Definition of Done

  • The ingestion engine is refactored to ensure execution results are not missed even in the event of a crash.
  • Unittests are added to cover failure cases and ensure results are reliably indexed.

peterargue avatar Feb 20 '24 22:02 peterargue

QuickNode ran into this same issue recently with 2 nodes, but there were missing Tx Results because the block to collection mapping used to lookup which block a transaction is in was missing. that's also written in processFinalizedBlock here: https://github.com/onflow/flow-go/blob/00e4924e0d1b2d69223e5db722b8e6f15ce92399/engine/access/ingestion/engine.go#L403

peterargue avatar Mar 15 '24 23:03 peterargue

We should do https://github.com/onflow/flow-go/issues/5185 as part of this project

peterargue avatar Mar 26 '24 19:03 peterargue

@UlyanaAndrukhiv I looked at the code a bit more. I don't think we need a job queue for the receipts. They are indexed as they are received from ENs to give the node an early indicator of what the most recently executed block is. However, if any are missed, the consensus follower will index them as they are included in blocks. So no worry about missing receipts.

The two upgrades I think we should tackle as part of this:

  1. Use a job queue to track finalized blocks to index. This should do all of the logic in processFinalizedBlock, and ensure that all of the special indexes are written for each block.
  2. Use a ConsumerProgress entry to track the last full height instead of using a direct db (https://github.com/onflow/flow-go/issues/5185). This allows us to get rid of the special logic in storage.Block that's only used by ANs for this purpose.

Someday, it would be great to move collection tracking to a jobqueue, but I don't think we gain much by spending the time on that now. Especially since we will likely revisit collections in the future after implementing optimistic syncing for execution data.

1 & 2 can be done separately, so let's keep them in different PRs for simplicity.

peterargue avatar Apr 10 '24 00:04 peterargue