optimism icon indicating copy to clipboard operation
optimism copied to clipboard

op-node: improve temporary error-recovery on block-sealing

Open protolambda opened this issue 1 year ago • 1 comments

Currently, as sequencer and derivation pipeline, we restart/reattempt block-building from the attributes input, rather than trying to recover the existing payload-job.

In the cases of a context-timeout the block-building job has likely expired, and if the job-identifier is unknown, then it won't be retrievable either anymore.

But in some cases it (the engine_getPayload call) can be re-attempted, and successfully return, without redoing block-building work. This can be improved in the sequencing code.

See review comments by Adrian in https://github.com/ethereum-optimism/optimism/pull/10991 for additional context.

protolambda avatar Jul 11 '24 07:07 protolambda

Once we have parallel derivation pipeline, the simplest answer here may be to just have the engine API client retry the getPayload call if it times out or can't connect (with some limit). Blocking while we retry is not a good idea for a synchronous pipeline. The assumption here is that getPayload is only failing to give any response if the node is already really overwhelmed or offline so attempting to start building a new block won't help.

Once you get an actual error response from the node it probably isn't recoverable so you should start building again.

ajsutton avatar Jul 11 '24 23:07 ajsutton