lotus icon indicating copy to clipboard operation
lotus copied to clipboard

feat: migration("re-indexing"), backfilling and diasgnostics tooling for the `ChainIndexer`

Open aarshkshah1992 opened this issue 5 months ago • 8 comments

ChainIndexer Migration and Diagnostics Tooling

This PR implements the "migration" (really re-indexing / backfilling), and diagnostics tooling for the ChainIndexer implemented in PR #12450, and is part of the work for #12453. This tooling takes the form of both RPC APIs on the daemon and lotus-shed CLI commands.

Re-indexing Process

The re-indexing tool enables clients to index their entire existing ChainState in the ChainIndexer. This process is necessary due to the removal of the existing MsgIndex, EthTxIndex, and EventIndex from Lotus.

Why Re-index Instead of Migrate?

We've chosen to re-index rather than migrate data from existing indices for two primary reasons:

  1. Known issues: The existing indices have multiple known problems, and migrating could perpetuate incorrect index entries.
  2. Lack of garbage collection: Existing indices contain many entries for which the corresponding tipset messages/events no longer exist in the ChainStore due to splitstore GC.

Instead, we're re-indexing the Chainstore/Chainstate on the node into the ChainIndexer. This ensures that all re-indexed entries have gone through the indexing logic of the new ChainIndexer and that the Index is in sync/reflects the actual contents of the Chainstore/Chainstate post re-indexing.

Diagnostics Tooling

This PR introduces diagnostic tools for detecting corrupt Index entries at specific epochs or epoch ranges.

While this PR implements functionality for optionally backfilling missing Index entries, it does not yet include the capability to "repair" corrupted Indexed entries. The repair functionality will be introduced in a subsequent PR. This approach allows us to first gather and analyze user reports, helping us understand the types and causes of corrupted Indexed entries(and if all they exist in the new ChainIndexer) before implementing repair mechanisms.

Core API

The fundamental building block for this tooling is the following RPC API:

type IndexValidation struct {
	TipsetKey string
	Height    uint64

	TotalMessages  uint64
	TotalEvents    uint64
	EventsReverted bool

	Backfilled bool
}

func (si *SqliteIndexer) ChainValidateIndex(ctx context.Context, epoch abi.ChainEpoch, backfill bool) (*types.IndexValidation, error)

This API has the following features:

  • Optionally backfills the Index with a tipset on the canonical chain for the given epoch if it is absent in the Index
  • Returns some aggregated stats for an indexed entry for diagnostics/inspection
  • Reports errors/corrupted indexed entries at the given epoch. Forms of Index corruption that can be diagnosed includes:
    • Presence of multiple non-reverted tipsets at the given epoch
    • Complete absence of a non-reverted tipset at the given epoch that does contain reverted tipsets
    • Mismatch between the Chainstore state and the Indexed entries (tipset messages/events)
    • Incorrect Indexing of null rounds at the given epoch

lotus-shed CLI tooling

The lotus-shed CLI tooling for both re-indexing/backfilling and diagnostics can then invoke this RPC API over epoch ranges. The corresponding lotus-shed backfill index [from, to] and lotus-shed inspect index [from, to] can then backfill/inspect/diagnose the Index for the given epoch ranges.

TODO

  • [ ] automated tests

aarshkshah1992 avatar Sep 11 '24 15:09 aarshkshah1992