optimism
optimism copied to clipboard
L1 DTL sync can fail when pointing at nodes that don't cache events
Describe the bug
When DATA_TRANSPORT_LAYER__SYNC_FROM_L1=true, the dtl can stall and fail to sync new transactions.
We see the following error in logs:
Error: bad response (status=502, headers={"content-length":"0","connection":"close","date":"Sun, 29 May 2022 17:34:38 GMT","via":"kong/2.8.1.0-enterprise-edition"}, body=null, requestBody="{\"method\":\"eth_getLogs\",\"params\":[{\"fromBlock\":\"0xcf7732\",\"toBlock\":\"0xde7af2\",\"address\":\"0xde1fcfb0851916ca5101820a69b13a4e276bd81f\",\"topics\":[\"0x9416a153a346f93d95f94b064ae3f148b6460473c6e82b3f9fc2521b873fcd6c\",\"0x02b616af23339f1e031e76333e2d5b1c3067beb78578c961911872cc2127ef8b\"]}],\"id\":650,\"jsonrpc\":\"2.0\"}", requestMethod="POST", url="https://venasaur.com/", code=SERVER_ERROR, version=web/5.5.1)
The RPC request seems to be originating from this line. The DTL queries eth_getLogs to retrieve events emitted by the AddressManager contract. However, the DTL creates this query with a block range, where the starting block is always the configured DATA_TRANSPORT_LAYER__L1_START_HEIGHT. This causes issues because as the JSON-RPC response of such a request would be ginormous, as dtl syncing is close to the tip. In the above log, the block range spans 1 million blocks. This most likely explains the HTTP 5xx error returned by the backend. This problem is compounded as the DTL repeatedly tries to retrieve logs from a large block range during each loop of the l1-ingestion service.
Desired Outcome
The DTL shouldn't resync logs starting from DATA_TRANSPORT_LAYER__L1_START_HEIGHT. It should remember previous events from prior queries and adjust the start_block eth_getLog parameter on successive syncs.
cc: @smartcontracts
Can confirm having the same issue, resulting in stale sync:
{"level":50,"time":1655037518661,"extra":{"message":"Error: missing response (requestBody=\"{\\\"method\\\":\\\"eth_getLogs\\\",\\\"params\\\":[{\\\"fromBlock\\\":\\\"0xcf7732\\\",\\\"toBlock\\\":\\\"0xe41f04\\\",\\\"address\\\":\\\"0xde1fcfb0851916ca5101820a69b13a4e276bd81f\\\",\\\"topics\\\":[\\\"0x9416a153a346f93d95f94b064ae3f148b6460473c6e82b3f9fc2521b873fcd6c\\\",\\\"0x02b616af23339f1e031e76333e2d5b1c3067beb78578c961911872cc2127ef8b\\\"]}],\\\"id\\\":50,\\\"jsonrpc\\\":\\\"2.0\\\"}\", requestMethod=\"POST\", serverError={\"code\":\"ECONNRESET\"}, url=\"http://localhost:8545\", code=SERVER_ERROR, version=web/5.6.1)","stack":"Error: missing response (requestBody=\"{\\\"method\\\":\\\"eth_getLogs\\\",\\\"params\\\":[{\\\"fromBlock\\\":\\\"0xcf7732\\\",\\\"toBlock\\\":\\\"0xe41f04\\\",\\\"address\\\":\\\"0xde1fcfb0851916ca5101820a69b13a4e276bd81f\\\",\\\"topics\\\":[\\\"0x9416a153a346f93d95f94b064ae3f148b6460473c6e82b3f9fc2521b873fcd6c\\\",\\\"0x02b616af23339f1e031e76333e2d5b1c3067beb78578c961911872cc2127ef8b\\\"]}],\\\"id\\\":50,\\\"jsonrpc\\\":\\\"2.0\\\"}\", requestMethod=\"POST\", serverError={\"code\":\"ECONNRESET\"}, url=\"http://localhost:8545\", code=SERVER_ERROR, version=web/5.6.1)\n at Logger.makeError (/home/optimism/optimism/node_modules/@ethersproject/providers/node_modules/@ethersproject/logger/src.ts/index.ts:261:28)\n at Logger.throwError (/home/optimism/optimism/node_modules/@ethersproject/providers/node_modules/@ethersproject/logger/src.ts/index.ts:273:20)\n at /home/optimism/optimism/node_modules/@ethersproject/providers/node_modules/@ethersproject/web/src.ts/index.ts:280:28\n at step (/home/optimism/optimism/node_modules/@ethersproject/providers/node_modules/@ethersproject/web/lib/index.js:33:23)\n at Object.throw (/home/optimism/optimism/node_modules/@ethersproject/providers/node_modules/@ethersproject/web/lib/index.js:14:53)\n at rejected (/home/optimism/optimism/node_modules/@ethersproject/providers/node_modules/@ethersproject/web/lib/index.js:6:65)\n at runMicrotasks (<anonymous>)\n at processTicksAndRejections (node:internal/process/task_queues:96:5)","code":"SERVER_ERROR"},"msg":"Caught an unhandled error"}
@platschi can be temporarily resolved by syncing from Alchemy/Infura/whatever, any node provider that caches events. I'm looking at potential fixes but have a few other things in the pipeline and likely won't have a fix out until end of next week at the earliest.
https://github.com/ethereum-optimism/optimism/blob/7baf49f1862fcad037a44e69b82b39e670307140/packages/data-transport-layer/src/services/l1-ingestion/service.ts#L353-L363
The block range is too large to get the events, I think it can be fixed
I meet this problem too, and found this issuse. but it seems this pr is pausing now... any progress?
@smartcontracts
Sorry to interrupt, how about the repair plan for this issue? :)
@BabySid (and perhaps others looking at this issue), I spent some more time thinking about this and I want to give some context as to where we're at with fixing this:
- Generally speaking, this issue occurs when syncing from L1 Geth nodes. It doesn't happen when syncing from Erigon or when syncing from node providers like Alchemy or Infura. Mainly, this is because Geth's log querying performance isn't very good.
- Solving this problem "the right way" requires modifying the
data-transport-layerto either introduce a new stateful entry in the dtl's database or introducing some relatively complex caching. - Since other syncing options exist (namely syncing via Erigon if you want to run your own node), there's less of a need to solve this issue immediately.
- We are all hands on deck attempting to ship Bedrock, the new version of Optimism that will make issues like this obsolete.
So, therefore:
- We will likely not fix this issue "the right way" until after we ship Bedrock.
- If you want to sync without trusting node providers, I would recommend syncing off of Erigon.
- If you really, really need to sync off of a Geth node, I can potentially put together a hacky (very unofficial)
data-transport-layerimage for you that would resolve the issue BUT could break under certain conditions (the hack can break if we use a certain upgrade path on L1, but we no longer use this upgrade path and have no plans to use this upgrade path in the future).
@BabySid (and perhaps others looking at this issue), I spent some more time thinking about this and I want to give some context as to where we're at with fixing this:
* Generally speaking, this issue occurs when syncing from L1 Geth nodes. It doesn't happen when syncing from Erigon or when syncing from node providers like Alchemy or Infura. Mainly, this is because Geth's log querying performance isn't very good. * Solving this problem "the right way" requires modifying the `data-transport-layer` to either introduce a new stateful entry in the dtl's database or introducing some relatively complex caching. * Since other syncing options exist (namely syncing via Erigon if you want to run your own node), there's less of a need to solve this issue immediately. * We are all hands on deck attempting to ship Bedrock, the new version of Optimism that will make issues like this obsolete.So, therefore:
* We will likely not fix this issue "the right way" until after we ship Bedrock. * If you want to sync without trusting node providers, I would recommend syncing off of Erigon. * If you really, really need to sync off of a Geth node, I can potentially put together a hacky (very unofficial) `data-transport-layer` image for you that would resolve the issue BUT could break under certain conditions (the hack can break if we use a certain upgrade path on L1, but we no longer use this upgrade path and have no plans to use this upgrade path in the future).
Thank you very much for replying these messages. For me, erigon has been used to synchronize now, and I will look forward to the Bedrock very much.
#3445 fixed this issue for mainnet