SPIKE: Make indexer 2.14 with local ledger more resilient in case of restarts.
Problem
We've deployed 2.14.0rc3 on testnet on 25+ nodes. After two days of testing (that includes random restarts) we've observed ledger cache ahead of postgres ledger which requires manual intervention. Happened to 3 different nodes.
2.14 is our first indexer with local ledger. We've skipped 2.12 and 2.13
{"error":"MakeProcessorWithLedgerInit() err: InitializeLedger() simple catchup err: RunMigration() err: MakeProcessor() err: the ledger cache is ahead of the required round and must be re-initialized","level":"error","msg":"blockprocessor.MakeProcessor() err MakeProcessorWithLedgerInit() err: InitializeLedger() simple catchup err: RunMigration() err: MakeProcessor() err: the ledger cache is ahead of the required round and must be re-initialized","time":"2022-08-23T08:26:37Z"}
Probably not generally fixable with current approach but maybe a "one block off" situation could be addressed.
Urgency
Not very urgent but all shutdowns were "clean" ones so statistically this is going to hurt.
Acceptance Criteria
- Use the
MaxAccountLookbackin the ledger to fetch recent StateDelta objects. - If the local ledger is ahead of postgres, use the historic StateDelta instead of computing a new one.
@urtho How did you manually fix the issue without a full reset of the indexer?
I just do fast catchup from a matching catchup from this list https://algorand-catchpoints.s3.us-east-2.amazonaws.com/consolidated/mainnet_catchpoints.txt
So downtime is only 40 minutes.
Thanks! Unfortunately, in my case, this did not work out: the indexer started indexing from start again... I'm not completely sure why.
@urtho post the Conduit split, is this still relevant?
no longer relevant