SPIKE: Make indexer 2.14 with local ledger more resilient in case of restarts.

Open urtho opened this issue 3 years ago • 3 comments

Problem

We've deployed 2.14.0rc3 on testnet on 25+ nodes. After two days of testing (that includes random restarts) we've observed ledger cache ahead of postgres ledger which requires manual intervention. Happened to 3 different nodes.

2.14 is our first indexer with local ledger. We've skipped 2.12 and 2.13

{"error":"MakeProcessorWithLedgerInit() err: InitializeLedger() simple catchup err: RunMigration() err: MakeProcessor() err: the ledger cache is ahead of the required round and must be re-initialized","level":"error","msg":"blockprocessor.MakeProcessor() err MakeProcessorWithLedgerInit() err: InitializeLedger() simple catchup err: RunMigration() err: MakeProcessor() err: the ledger cache is ahead of the required round and must be re-initialized","time":"2022-08-23T08:26:37Z"}

Probably not generally fixable with current approach but maybe a "one block off" situation could be addressed.

Urgency

Not very urgent but all shutdowns were "clean" ones so statistically this is going to hurt.

Acceptance Criteria

Use the MaxAccountLookback in the ledger to fetch recent StateDelta objects.
If the local ledger is ahead of postgres, use the historic StateDelta instead of computing a new one.

Aug 23 '22 08:08 urtho

@urtho How did you manually fix the issue without a full reset of the indexer?

May 31 '23 13:05 fabrice102

I just do fast catchup from a matching catchup from this list https://algorand-catchpoints.s3.us-east-2.amazonaws.com/consolidated/mainnet_catchpoints.txt

So downtime is only 40 minutes.

May 31 '23 15:05 urtho

Thanks! Unfortunately, in my case, this did not work out: the indexer started indexing from start again... I'm not completely sure why.

May 31 '23 17:05 fabrice102

@urtho post the Conduit split, is this still relevant?

Jun 27 '24 16:06 gmalouf

no longer relevant

Jun 27 '24 17:06 urtho