indexer Running the pre-built indexer generates 'next round to account' error after a while without apparent reason

Description

I'm running the pre-built indexer (v2.15.3) on an AWS EC2 instance together with an archival node, and in the last 48h I've had the same issue twice: the indexer stops pushing data to the Postgres database (Timescale Cloud, so fully managed) with the following error:

{"error":"Process() handler err: AddBlock() err: TxWithRetry() err: attemptTx() err: AddBlock() adding block round 27670983 but next round to account is 27670982","level":"error","msg":"block 27670983 import failed","time":"2023-03-16T08:02:45Z"}

The command I use to run the indexer is: algorand-indexer daemon --data-dir /home/ubuntu/indexerdata -d /var/lib/algorand --postgres "$TIMESCALE_PROD"

The node itself is still properly running at that point in time, output of goal node status -w 1000 is:

Last committed block: 27681419
Time since last block: 0.6s
Sync Time: 0.0s
Last consensus protocol: https://github.com/algorandfoundation/specs/tree/44fa607d6051730f5264526bf3c108d51f0eadb6
Next consensus protocol: https://github.com/algorandfoundation/specs/tree/44fa607d6051730f5264526bf3c108d51f0eadb6
Round for next consensus protocol: 27681420
Next consensus protocol supported: true
Last Catchpoint: 27680000#NA63SDQJD63NR3QPNC2NXYV6FPUJWHNJY6DDAMGURQ76CT2MYUUQ
Genesis ID: mainnet-v1.0
Genesis hash: wGHE2Pwdvd7S12BL5FaOP20EGYesN73ktiC1qzkkit8=

When I restart the indexer, all I get is the prompt to re-initialise the ledger:

{"error":"MakeProcessorWithLedgerInit() err: InitializeLedger() simple catchup err: RunMigration() err: MakeProcessor() err: the ledger cache is ahead of the required round and must be re-initialized","level":"error","msg":"blockprocessor.MakeProcessor() err MakeProcessorWithLedgerInit() err: InitializeLedger() simple catchup err: RunMigration() err: MakeProcessor() err: the ledger cache is ahead of the required round and must be re-initialized","time":"2023-03-16T08:06:41Z"}

Currently the only way I know to get it up and running again is by clearing out the indexer's data directory and starting sync again from the nearest catchpoint: algorand-indexer daemon --data-dir /home/ubuntu/indexerdata -d /var/lib/algorand --postgres "$TIMESCALE_PROD" --catchpoint "27670000#74HTMMCL63E74B43FLS3LHHQRMDO54HTF6FKC2JZK3K3PXNY6ZYQ"

Is this a known issue? Can I somehow make the indexer more robust to catch these kinds of issues?

As this issue seems to be fully indexer related (unless I'm missing something here), I thought it might be good to discuss this here. We specifically use the provided indexer so we don't have to write our own code and can rely on the stability provided out of the box, so looking forward to solving this!

Our environment

Software version: 3.14.2.stable
Node status: see above
Indexer version: 2.15.3
Server: AWS EC2 c5.large running Ubuntu 20.04
Postgres: Timescale Cloud (v2.10.0) running Postgres v14.7

Steps to reproduce

Unknown, but seems to be happening a lot the past week.

Mar 16 '23 09:03 MatteusDeloge

I believe this sort of thing may happen if you have multiple Indexer writers running at the same time. Resetting the data directory is the right way to recover.

Mar 23 '23 15:03 winder

Normally I just have a single writer in a single process, so not really sure if this is the case here. If it is, then it seems like the problem exist within this version of the indexer.

Mar 28 '23 11:03 MatteusDeloge

I'm unable to reproduce the error with v2.15.3 indexer. I recommend switch to using Conduit and you'll not get this error.

May 16 '23 20:05 shiqizng

Indexer 2.x was retired in 2023.

May 23 '24 15:05 gmalouf