aligned_layer icon indicating copy to clipboard operation
aligned_layer copied to clipboard

fix(aggregator): (WIP) fetch task on task response if not cached

Open JulianVentura opened this issue 1 year ago • 2 comments

Aggregator fetch missed batches

Description

This PR adds a fix so the aggregator is able to recover and process missed batches when an unknown response from an operator is received.

Type of change

  • [x] Bug fix

Checklist

  • [ ] “Hotfix” to testnet, everything else to staging
  • [x] Linked to Github Issue
  • [ ] This change depends on code or research by an external entity
    • [ ] Acknowledgements were updated to give credit
  • [ ] Unit tests added
  • [ ] This change requires new documentation.
    • [ ] Documentation has been added/updated.
  • [ ] This change is an Optimization
    • [ ] Benchmarks added/run
  • [x] Has a known issue
    • #1350
  • [ ] If your PR changes the Operator compatibility (Ex: Upgrade prover versions)
    • [ ] This PR adds compatibility for operator for both versions and do not change batcher/docs/examples
    • [ ] This PR updates batcher and docs/examples to the newer version. This requires the operator are already updated to be compatible

Basic flow

On recover, if the aggregator has missed some new batches, the operators will process them and start sending signed responses to the aggregator. The main idea of this PR is that once this happens, the aggregator should first check if the received task is known to him (internal map) and if it isn't, try to fetch if from Ethereum logs. This should be done in a retry fashion, since network may be congested and events may take longer to arrive to certain RPC nodes.

Retry logic

After a response is received, the aggregator will check for the corresponding task in its internal map with 3 retries, waiting 1 sec, 2 sec, and 4 sec, respectively. If the task is not found in internal maps, it will try to fetch logs. While doing so, some calls will be made to Ethereum, each of them having its own retry logic:

  • Get the current block number: Retry times (3 retries): 1 sec, 2 sec, 4 sec.
  • Filter batch: Retry times (3 retries): 1 sec, 2 sec, 4 sec.
  • Get batch state: Retry times (3 retries): 12 sec (1 Blocks), 24 sec (2 Blocks), 48 sec (4 Blocks)

How to test

  1. Start all services, including a local explorer and 3 operators.
  2. Shut down the aggregator.
  3. Start sending proofs with make batcher_send_infinite_sp1.
  4. Wait a few minutes, see how the batches start to accumulate on the explorer as Pending, the more you wait, the better.
  5. Start the aggregator.
  6. You should see how some of the missed batches get verified, as well as the new ones.

You should bump the operator MaxRetries under rpc_client.go so the operators keep retrying sending responses to the aggregator.

You may also modify the pending_batch_fetch_block_range config value under config-files/config-aggregator.yaml to test for different scenarios. On the same file, you may also bump garbage_collector_period or garbage_collector_tasks_age so batches are not removed before they get verified.

We should stress test this PR so we are sure that no concurrency bug is possible, and for that you should try with the following:

  • Use as many operators as possible
  • When aggregator is down, wait as much time as possible, so many batches accumulate ~50 or more.
  • Modify the operator's retries variables so they retry indefinitely and very rapidly (try with wait times below 1 second).
  • Set the pending_batch_fetch_block_range variable to a large number, so all batches are fetched.

Closes #1350

JulianVentura avatar Nov 04 '24 16:11 JulianVentura

Looks good, need to test later.

Oppen avatar Nov 05 '24 22:11 Oppen

Need one more test and a review

MauroToscano avatar Nov 13 '24 17:11 MauroToscano