fix(aggregator): (WIP) fetch task on task response if not cached
Aggregator fetch missed batches
Description
This PR adds a fix so the aggregator is able to recover and process missed batches when an unknown response from an operator is received.
Type of change
- [x] Bug fix
Checklist
- [ ] “Hotfix” to
testnet, everything else tostaging - [x] Linked to Github Issue
- [ ] This change depends on code or research by an external entity
- [ ] Acknowledgements were updated to give credit
- [ ] Unit tests added
- [ ] This change requires new documentation.
- [ ] Documentation has been added/updated.
- [ ] This change is an Optimization
- [ ] Benchmarks added/run
- [x] Has a known issue
- #1350
- [ ] If your PR changes the Operator compatibility (Ex: Upgrade prover versions)
- [ ] This PR adds compatibility for operator for both versions and do not change batcher/docs/examples
- [ ] This PR updates batcher and docs/examples to the newer version. This requires the operator are already updated to be compatible
Basic flow
On recover, if the aggregator has missed some new batches, the operators will process them and start sending signed responses to the aggregator. The main idea of this PR is that once this happens, the aggregator should first check if the received task is known to him (internal map) and if it isn't, try to fetch if from Ethereum logs. This should be done in a retry fashion, since network may be congested and events may take longer to arrive to certain RPC nodes.
Retry logic
After a response is received, the aggregator will check for the corresponding task in its internal map with 3 retries, waiting 1 sec, 2 sec, and 4 sec, respectively. If the task is not found in internal maps, it will try to fetch logs. While doing so, some calls will be made to Ethereum, each of them having its own retry logic:
- Get the current block number: Retry times (3 retries): 1 sec, 2 sec, 4 sec.
- Filter batch: Retry times (3 retries): 1 sec, 2 sec, 4 sec.
- Get batch state: Retry times (3 retries): 12 sec (1 Blocks), 24 sec (2 Blocks), 48 sec (4 Blocks)
How to test
- Start all services, including a local explorer and 3 operators.
- Shut down the aggregator.
- Start sending proofs with
make batcher_send_infinite_sp1. - Wait a few minutes, see how the batches start to accumulate on the explorer as
Pending, the more you wait, the better. - Start the aggregator.
- You should see how some of the missed batches get verified, as well as the new ones.
You should bump the operator MaxRetries under rpc_client.go so the operators keep retrying sending responses to the aggregator.
You may also modify the pending_batch_fetch_block_range config value under config-files/config-aggregator.yaml to test for different scenarios. On the same file, you may also bump garbage_collector_period or garbage_collector_tasks_age so batches are not removed before they get verified.
We should stress test this PR so we are sure that no concurrency bug is possible, and for that you should try with the following:
- Use as many operators as possible
- When aggregator is down, wait as much time as possible, so many batches accumulate ~50 or more.
- Modify the operator's retries variables so they retry indefinitely and very rapidly (try with wait times below 1 second).
- Set the
pending_batch_fetch_block_rangevariable to a large number, so all batches are fetched.
Closes #1350
Looks good, need to test later.
Need one more test and a review