lighthouse
lighthouse copied to clipboard
Handle unknown head block during attestation publishing
Description
Presently if an attestation is POSTed to the beacon node for a head block that hasn't finished being imported, Lighthouse will log a nasty ERRO
:
ERRO Failure verifying attestation for gossip, attestation_slot: XX, committee_index: XX, request_index: X, error: UnknownHeadBlock { beacon_block_root: 0xa98ce8635c5493907d6536905b46eb09db54f8465a82d95aae90650509884f4a }
It will also return a 400 BAD REQUEST
to the VC, which creates a knock-on effect. There's an old issue about this here #1971, which was closed because we started queueing attestations from gossip in #635. We still don't have any queueing for attestations from the HTTP API, which is why this error occurs.
In practice, this will often be encountered by users running Vouch – which broadcasts attestations to all nodes, regardless of whether they already know the head block.
Version
Lighthouse v4.4.1 and older
Present Behaviour
- Lighthouse rejects any attestation received on the HTTP API that doesn't immediately pass gossip validation. This includes attestations for blocks that are in the process of being imported.
Expected Behaviour
- Lighthouse queues attestations received on the HTTP API so that it can:
- Return a 200 OK to the caller, and
- Not log an error message
Steps to resolve
In terms of implementation I think we could use the reprocessing queue. The hard part is blocking the HTTP handler waiting for the attestation to be dequeued and processed. I had a go at implementing something similar for blocks in #4643, which wasn't merged because it ended up being poorly suited to the problem it was trying to solve. We could probably adapt that code, particularly the use of a channel to send back the completion message, and use it for attestations. We could even extend the existing attestation reprocessing logic with an optional channel that gets initialized (and waited on) only for HTTP attestations.
I'd like to take a stab at this one
@michaelsproul will this issue be fixed in the 4.6.0
release, by a future PR or other unrelated one?
@ricardolyn We don't have an implementation yet, so probably not in 4.6
@michaelsproul i didn't end up getting a chance to work on this, feel free to take it. thanks!
Thanks. I'll start on this now. Will aim to get it in for 4.7.0/5.0.0 ( :eyes: )
Implemented, but I haven't had a chance to test yet and probably won't get to until after the holidays. We are all hands on deck trying to ship v4.6.0.
https://github.com/sigp/lighthouse/pull/5010
@michaelsproul Is it scheduled to be included in v4.6.0? Obol users face this issue.
No it won't make 4.6. Needs more testing
Obol users should still be able to attest successfully via the BN that created the attestation, or am I mistaken?
@michaelsproul Obol is a middleware between the CL and the VC. So basically Obol is the "BN that created the attestation", but they are not distributing. So if Lighthouse drops it, the attestation is missed. Which is what we have seen.
@koraykoska Where does Obol get the attestation from though? My point is that at least one of the BNs in the cluster must have produced it, and that BN should succeed in publishing it. Obol should not be attesting to blocks that aren't validated by any of the BNs in its cluster.
To be clear, we have something like this, don't we?:
VC1 -> Obol1 -> BN1
VC2 -> Obol2 -> BN2
VC3 -> Obol3 -> BN3
If BN2 produces the attestation, but BN1 and BN3 haven't processed the block yet, then when it gets published can't BN2 publish the full attestation with signatures from VC1+VC2+VC3? It shouldn't matter if BN1 and BN3 can't publish
CC @xenowits
The cluster can submit the attestation but the node with this bug misses its duty. In a 3/4 cluster this means if 2 are in this state at the same time, the cluster misses its attestation duty. I don't know the internals, but that's our observation.
On Wed, 17 Jan 2024 at 13:28, Michael Sproul @.***> wrote:
@koraykoska https://github.com/koraykoska Where does Obol get the attestation from though? My point is that at least one of the BNs in the cluster must have produced it, and that BN should succeed in publishing it. Obol should not be attesting to blocks that aren't validated by any of the BNs in its cluster.
To be clear, we have something like this, don't we?:
VC1 -> Obol1 -> BN1 VC2 -> Obol2 -> BN2 VC3 -> Obol3 -> BN3
If BN2 produces the attestation, but BN1 and BN3 haven't processed the block yet, then when it gets published can't BN2 publish the full attestation with signatures from VC1+VC2+VC3?
CC @xenowits https://github.com/xenowits
— Reply to this email directly, view it on GitHub https://github.com/sigp/lighthouse/issues/4699#issuecomment-1895709643, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWUTTKBMTBNDLIRXJV4DR3YO67VJAVCNFSM6AAAAAA4K7N2P2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJVG4YDSNRUGM . You are receiving this because you were mentioned.Message ID: @.***>
Thanks @koraykoska. That seems strange to me, so I'll ask the Obol guys to clarify. My understanding was that once the cluster had produced a "full" attestation with aggregated signatures from all participants, it would be able to publish it using any single beacon node.
Confirmed with Obol that the complete attestation is published via all BNs, so the attestation should succeed as long as at least 1 node has finished processing the block and is capable of validating the attestation.
@koraykoska If you check your validator's performance for the slots where your BN fails, do you notice attestation penalties?
Either way, we will fix this shortly after 4.6 with #5010
Closed by https://github.com/sigp/lighthouse/pull/5010 which is in the upcoming v5.0.0 release :tada: