As connext protocol, we need to have a cronjob or polling loop for relayer tasks

Open jakekidd opened this issue 3 years ago • 0 comments

Problem In the Interim AMB Messaging flow, the Relayer agent is relied upon for AMB flow

Two components to this problem:

Relayers need to be told what to do and when to do it.
Relayers need to be funded.

Ideas to solve this

For problem (1) we should write a backend task to occasionally ping the relayer to do certain scheduled activities. We can schedule the activities up front using a Gelato Task in the on-chain contracts, or we could just use the relayer API - either way is fine. This should be a part of lighthouse - it aligns well with lighthouse responsibilities and doubles up on our current infra (so we don't have to bootstrap yet another agent).

For (2) we'll need to have a Gelato Vault regularly funded. This is fairly straightforward in of itself, but there's a bit of complexity here since we're obvi going to be doing things on multiple chains. We need to have some sort of alert system set up to let us know when we need to refill the Vault.

Acceptance Criteria

5 Subtasks:

[ ] A. Root Caching
- First and most basic task!
- Lighthouse should have a polling loop that checks the Connector contracts on every chain (including Eth Mainnet / Hub) to retrieve the current outboundRoot and aggregateRoot.
- Compare these roots to what is currently stored in Carto DB. We're going to use this comparison to determine whether action is required:
  - If the outboundRoot has been updated, go to Subtask B.
  - If the aggregateRoot has been updated on a <Domain>Connector contract on the Hub chain / Eth Mainnet go to Subtask C.
  - If the aggregateRoot has been updated on a <Domain>Connector contract on that Domain / Spoke chain / L2 go to Subtask D.
- Finally, update Carto DB with the new outboundRoot and aggregateRoot - these values should be indexed by domain ID in the database!
[ ] B. sendRoot (see Step 2 in the Diagram)
- On Chain X: User calls Connext.xcall => Connext.xcall > Connector.dispatch => a new outboundRoot is added in the Connector
- Lighthouse, in the polling loop from Subtask A, detects that the outboundRoot has been updated on Chain X.
- Lighthouse queries Carto DB: is there a current ACTIVE OutboundBatch (or DispatchBatch, whatever you want to call it) for Chain X?
  - YES: Do nothing!
  - NO: Lighthouse should create a new OutboundBatch row in the DB. Should have a timestamp for when the batch started, along with any other info... but do NOT store the outboundRoot! In fact, not a whole lot of info is needed to be stored here... afaik just the timestamp is necessary.
- LATER...
  - Lighthouse polling loop queries Carto DB to get the active OutboundBatches (for each chain, if any). It checks to see if enough time has elapsed to send each Batch.
  - If enough time has elapsed (e.g. 2 hours) Lighthouse formats calldata for Connector.sendRoot (fairly simple, no arguments involved), then sends the calldata in an API request to Gelato relayers.
  - Gelato relayer calls Connector.sendRoot, root is sent in AMB message.
[ ] C. propagate (see Step 3 in the Diagram)
- Context: Ethereum Mainnet will serve as our "Hub chain" for prod. Here, inbound roots from all chains will be aggregated and then published to all other chains.
- Lighthouse, in the polling loop from Subtask A, detects that an updated aggregateRoot on one of the Connector's on the hub chain after coming through the AMB.
- Lighthouse queries Carto DB: is there a current ACTIVE PropagateBatch in the DB?
  - YES: Do nothing!
  - NO: Create a new PropagateBatch in DB, w/ marked timestamp (NOT indexed by chain, because it should know it's only for the hub).
- LATER...
  - After an extended period of time (like 8 hours.... ~see NOTE 2 below for why this should be longer than Subtask B's polling loop!), the Lighthouse polling loop detects the active PropagateBatch is ready to send (sufficient time has elapsed), and formats RootManager.propagate calldata (again, no method arguments, really basic call), then sends it to Gelato API.
  - Gelato relayer calls RootManager.propagate, root is aggregated from all chains and sent out.
[ ] D. proveAndProcess (see Step 4 in the Diagram)
- This is the equivalent to the Nomad handle method call (sort of... more like equivalent to that step at least...)
- In Subtask A's polling loop Lighthouse detects that, in a Connector contract on Spoke Chain/L2 Y, the aggregateRoot has been updated.
- When there's a root update, we need to detect what messages have been "delivered" to this chain. We'll need an index of all inflight messages in the DB to manage this! We'll want to grab all messages that are inflight / haven't been delivered yet whose destinations are Chain Y!
- Batch those messages together for delivery (assuming there are any... see NOTE 2 below), along with their proofs (leafs in the root). Format calldata for proveAndProcess.
- Estimate gas for the call: if the gas amount is > 10k, split up those messages into multiple batches!!! (Make sure to set from address in the tx request for eth_estimateGas call to == relayer address, as the call should be permissioned in prod).
- Then IMMEDIATELY send the call(s) off to Gelato via API (why wait?).
- Gelato relayer calls Connector.proveAndProcess for the message batch. All the messages are handled.
[ ] E: Gelato Vault Management
- This task should be fairly straightforward. Set up a Gelato vault for paying Relayers on all the chains we support, however that works. Ideally, have a script to bootstrap that, maybe?
- Configure the vault in Lighthouse, assuming it needs to be referenced in the API call.
- Actually fund the vault with ETH, MATIC, whatever

IMPORTANT NOTES FOR YOUR CONSIDERATION NOTE 1: There is a bit of a race condition / optimization problem I suspect we'll run into with Subtask A and B above.

Let's say User A is bridging from Opti => Polygon, and User B is bridging Mainnet => Polygon.
User A send the xcalls and the outboundRoot gets updated on Opti. After an hour, the outboundRoot gets sent out.
The outboundRoot for Opti currently will take 7 days to arrive for the AMB.
After 6.9 days elapse, User B by coincidence sends their xcall, which updates outboundRoot for mainnet.
The outboundRoot for mainnet gets sent out an hour later - right before the the inbound root from Opti was going to arrive!
Had we waited just a little longer, we could have aggregated the root on mainnet to include User A's call, and send both to Polygon.

Annoying. Maybe this can be patched a bit by just making the window for Subtask B longer than the one for Subtask A. I.e. outboundRoots on mainnet (the hub) take 8 hours before sent out, whereas outboundRoots on other networks take only 1 hour. Would also make sense economically since ETH expensive!

Ideally we would have these window lengths weighted by native token price and gas price and batch size (a constraint problem), e.g. wait just a lil longer than an hour if there's only 1 or 2 txs in a batch... but it's probably sufficient to stick to estimates for the time being.

NOTE 2: It is possible for no messages to be delivered to a chain. In current design, we actually just spam aggregate root updates out to all chains, even if there are no messages intended for ChainX, it will still receive the updated aggregate root. Kind of annoying. Maybe we can fix this?

NOTE 3: Do all AMBs automatically handle calling Connector.receiveRoot for us? That is the xchain calldata we are sending through the AMB, but the question here is whether the AMB just makes that "verified call" available or has a relayer system for automatically calling it for us..... does this vary by chain?? Will we need to get relayers to do this in some cases??

Aug 12 '22 00:08 jakekidd