optimism L1 DTL sync can fail when pointing at nodes that don't cache events

Describe the bug When DATA_TRANSPORT_LAYER__SYNC_FROM_L1=true, the dtl can stall and fail to sync new transactions. We see the following error in logs:

Error: bad response (status=502, headers={"content-length":"0","connection":"close","date":"Sun, 29 May 2022 17:34:38 GMT","via":"kong/2.8.1.0-enterprise-edition"}, body=null, requestBody="{\"method\":\"eth_getLogs\",\"params\":[{\"fromBlock\":\"0xcf7732\",\"toBlock\":\"0xde7af2\",\"address\":\"0xde1fcfb0851916ca5101820a69b13a4e276bd81f\",\"topics\":[\"0x9416a153a346f93d95f94b064ae3f148b6460473c6e82b3f9fc2521b873fcd6c\",\"0x02b616af23339f1e031e76333e2d5b1c3067beb78578c961911872cc2127ef8b\"]}],\"id\":650,\"jsonrpc\":\"2.0\"}", requestMethod="POST", url="https://venasaur.com/", code=SERVER_ERROR, version=web/5.5.1)

The RPC request seems to be originating from this line. The DTL queries eth_getLogs to retrieve events emitted by the AddressManager contract. However, the DTL creates this query with a block range, where the starting block is always the configured DATA_TRANSPORT_LAYER__L1_START_HEIGHT. This causes issues because as the JSON-RPC response of such a request would be ginormous, as dtl syncing is close to the tip. In the above log, the block range spans 1 million blocks. This most likely explains the HTTP 5xx error returned by the backend. This problem is compounded as the DTL repeatedly tries to retrieve logs from a large block range during each loop of the l1-ingestion service.

Desired Outcome The DTL shouldn't resync logs starting from DATA_TRANSPORT_LAYER__L1_START_HEIGHT. It should remember previous events from prior queries and adjust the start_block eth_getLog parameter on successive syncs.

Jun 02 '22 01:06 Inphi

cc: @smartcontracts

Jun 02 '22 01:06 Inphi

Can confirm having the same issue, resulting in stale sync:

{"level":50,"time":1655037518661,"extra":{"message":"Error: missing response (requestBody=\"{\\\"method\\\":\\\"eth_getLogs\\\",\\\"params\\\":[{\\\"fromBlock\\\":\\\"0xcf7732\\\",\\\"toBlock\\\":\\\"0xe41f04\\\",\\\"address\\\":\\\"0xde1fcfb0851916ca5101820a69b13a4e276bd81f\\\",\\\"topics\\\":[\\\"0x9416a153a346f93d95f94b064ae3f148b6460473c6e82b3f9fc2521b873fcd6c\\\",\\\"0x02b616af23339f1e031e76333e2d5b1c3067beb78578c961911872cc2127ef8b\\\"]}],\\\"id\\\":50,\\\"jsonrpc\\\":\\\"2.0\\\"}\", requestMethod=\"POST\", serverError={\"code\":\"ECONNRESET\"}, url=\"http://localhost:8545\", code=SERVER_ERROR, version=web/5.6.1)","stack":"Error: missing response (requestBody=\"{\\\"method\\\":\\\"eth_getLogs\\\",\\\"params\\\":[{\\\"fromBlock\\\":\\\"0xcf7732\\\",\\\"toBlock\\\":\\\"0xe41f04\\\",\\\"address\\\":\\\"0xde1fcfb0851916ca5101820a69b13a4e276bd81f\\\",\\\"topics\\\":[\\\"0x9416a153a346f93d95f94b064ae3f148b6460473c6e82b3f9fc2521b873fcd6c\\\",\\\"0x02b616af23339f1e031e76333e2d5b1c3067beb78578c961911872cc2127ef8b\\\"]}],\\\"id\\\":50,\\\"jsonrpc\\\":\\\"2.0\\\"}\", requestMethod=\"POST\", serverError={\"code\":\"ECONNRESET\"}, url=\"http://localhost:8545\", code=SERVER_ERROR, version=web/5.6.1)\n    at Logger.makeError (/home/optimism/optimism/node_modules/@ethersproject/providers/node_modules/@ethersproject/logger/src.ts/index.ts:261:28)\n    at Logger.throwError (/home/optimism/optimism/node_modules/@ethersproject/providers/node_modules/@ethersproject/logger/src.ts/index.ts:273:20)\n    at /home/optimism/optimism/node_modules/@ethersproject/providers/node_modules/@ethersproject/web/src.ts/index.ts:280:28\n    at step (/home/optimism/optimism/node_modules/@ethersproject/providers/node_modules/@ethersproject/web/lib/index.js:33:23)\n    at Object.throw (/home/optimism/optimism/node_modules/@ethersproject/providers/node_modules/@ethersproject/web/lib/index.js:14:53)\n    at rejected (/home/optimism/optimism/node_modules/@ethersproject/providers/node_modules/@ethersproject/web/lib/index.js:6:65)\n    at runMicrotasks (<anonymous>)\n    at processTicksAndRejections (node:internal/process/task_queues:96:5)","code":"SERVER_ERROR"},"msg":"Caught an unhandled error"}

Jun 12 '22 12:06 platschi

@platschi can be temporarily resolved by syncing from Alchemy/Infura/whatever, any node provider that caches events. I'm looking at potential fixes but have a few other things in the pipeline and likely won't have a fix out until end of next week at the earliest.

Jun 16 '22 16:06 smartcontracts

https://github.com/ethereum-optimism/optimism/blob/7baf49f1862fcad037a44e69b82b39e670307140/packages/data-transport-layer/src/services/l1-ingestion/service.ts#L353-L363

The block range is too large to get the events, I think it can be fixed

Jul 15 '22 03:07 ericlee42

I meet this problem too, and found this issuse. but it seems this pr is pausing now... any progress?

Aug 01 '22 06:08 BabySid

@smartcontracts

Sorry to interrupt, how about the repair plan for this issue? :)

Aug 01 '22 08:08 BabySid

@BabySid (and perhaps others looking at this issue), I spent some more time thinking about this and I want to give some context as to where we're at with fixing this:

Generally speaking, this issue occurs when syncing from L1 Geth nodes. It doesn't happen when syncing from Erigon or when syncing from node providers like Alchemy or Infura. Mainly, this is because Geth's log querying performance isn't very good.
Solving this problem "the right way" requires modifying the data-transport-layer to either introduce a new stateful entry in the dtl's database or introducing some relatively complex caching.
Since other syncing options exist (namely syncing via Erigon if you want to run your own node), there's less of a need to solve this issue immediately.
We are all hands on deck attempting to ship Bedrock, the new version of Optimism that will make issues like this obsolete.

So, therefore:

We will likely not fix this issue "the right way" until after we ship Bedrock.
If you want to sync without trusting node providers, I would recommend syncing off of Erigon.
If you really, really need to sync off of a Geth node, I can potentially put together a hacky (very unofficial) data-transport-layer image for you that would resolve the issue BUT could break under certain conditions (the hack can break if we use a certain upgrade path on L1, but we no longer use this upgrade path and have no plans to use this upgrade path in the future).

Sep 06 '22 16:09 smartcontracts

@BabySid (and perhaps others looking at this issue), I spent some more time thinking about this and I want to give some context as to where we're at with fixing this:

* Generally speaking, this issue occurs when syncing from L1 Geth nodes. It doesn't happen when syncing from Erigon or when syncing from node providers like Alchemy or Infura. Mainly, this is because Geth's log querying performance isn't very good.

* Solving this problem "the right way" requires modifying the `data-transport-layer` to either introduce a new stateful entry in the dtl's database or introducing some relatively complex caching.

* Since other syncing options exist (namely syncing via Erigon if you want to run your own node), there's less of a need to solve this issue immediately.

* We are all hands on deck attempting to ship Bedrock, the new version of Optimism that will make issues like this obsolete.

So, therefore:

* We will likely not fix this issue "the right way" until after we ship Bedrock.

* If you want to sync without trusting node providers, I would recommend syncing off of Erigon.

* If you really, really need to sync off of a Geth node, I can potentially put together a hacky (very unofficial) `data-transport-layer` image for you that would resolve the issue BUT could break under certain conditions (the hack can break if we use a certain upgrade path on L1, but we no longer use this upgrade path and have no plans to use this upgrade path in the future).

Thank you very much for replying these messages. For me, erigon has been used to synchronize now, and I will look forward to the Bedrock very much.

Sep 07 '22 06:09 BabySid

#3445 fixed this issue for mainnet

Sep 24 '22 02:09 smartcontracts

optimism optimism copied to clipboard

L1 DTL sync can fail when pointing at nodes that don't cache events

optimism
optimism copied to clipboard