lighthouse icon indicating copy to clipboard operation
lighthouse copied to clipboard

Handle unknown head block during attestation publishing

Open michaelsproul opened this issue 1 year ago • 13 comments

Description

Presently if an attestation is POSTed to the beacon node for a head block that hasn't finished being imported, Lighthouse will log a nasty ERRO:

ERRO Failure verifying attestation for gossip, attestation_slot: XX, committee_index: XX, request_index: X, error: UnknownHeadBlock { beacon_block_root: 0xa98ce8635c5493907d6536905b46eb09db54f8465a82d95aae90650509884f4a }

It will also return a 400 BAD REQUEST to the VC, which creates a knock-on effect. There's an old issue about this here #1971, which was closed because we started queueing attestations from gossip in #635. We still don't have any queueing for attestations from the HTTP API, which is why this error occurs.

In practice, this will often be encountered by users running Vouch – which broadcasts attestations to all nodes, regardless of whether they already know the head block.

Version

Lighthouse v4.4.1 and older

Present Behaviour

  • Lighthouse rejects any attestation received on the HTTP API that doesn't immediately pass gossip validation. This includes attestations for blocks that are in the process of being imported.

Expected Behaviour

  • Lighthouse queues attestations received on the HTTP API so that it can:
    • Return a 200 OK to the caller, and
    • Not log an error message

Steps to resolve

In terms of implementation I think we could use the reprocessing queue. The hard part is blocking the HTTP handler waiting for the attestation to be dequeued and processed. I had a go at implementing something similar for blocks in #4643, which wasn't merged because it ended up being poorly suited to the problem it was trying to solve. We could probably adapt that code, particularly the use of a channel to send back the completion message, and use it for attestations. We could even extend the existing attestation reprocessing logic with an optional channel that gets initialized (and waited on) only for HTTP attestations.

michaelsproul avatar Sep 05 '23 00:09 michaelsproul

I'd like to take a stab at this one

eserilev avatar Oct 05 '23 20:10 eserilev

@michaelsproul will this issue be fixed in the 4.6.0 release, by a future PR or other unrelated one?

ricardolyn avatar Dec 05 '23 19:12 ricardolyn

@ricardolyn We don't have an implementation yet, so probably not in 4.6

michaelsproul avatar Dec 05 '23 20:12 michaelsproul

@michaelsproul i didn't end up getting a chance to work on this, feel free to take it. thanks!

eserilev avatar Dec 12 '23 18:12 eserilev

Thanks. I'll start on this now. Will aim to get it in for 4.7.0/5.0.0 ( :eyes: )

michaelsproul avatar Dec 13 '23 01:12 michaelsproul

Implemented, but I haven't had a chance to test yet and probably won't get to until after the holidays. We are all hands on deck trying to ship v4.6.0.

https://github.com/sigp/lighthouse/pull/5010

michaelsproul avatar Dec 14 '23 00:12 michaelsproul

@michaelsproul Is it scheduled to be included in v4.6.0? Obol users face this issue.

koraykoska avatar Jan 17 '24 10:01 koraykoska

No it won't make 4.6. Needs more testing

Obol users should still be able to attest successfully via the BN that created the attestation, or am I mistaken?

michaelsproul avatar Jan 17 '24 11:01 michaelsproul

@michaelsproul Obol is a middleware between the CL and the VC. So basically Obol is the "BN that created the attestation", but they are not distributing. So if Lighthouse drops it, the attestation is missed. Which is what we have seen.

koraykoska avatar Jan 17 '24 11:01 koraykoska

@koraykoska Where does Obol get the attestation from though? My point is that at least one of the BNs in the cluster must have produced it, and that BN should succeed in publishing it. Obol should not be attesting to blocks that aren't validated by any of the BNs in its cluster.

To be clear, we have something like this, don't we?:

VC1 -> Obol1 -> BN1
VC2 -> Obol2 -> BN2
VC3 -> Obol3 -> BN3

If BN2 produces the attestation, but BN1 and BN3 haven't processed the block yet, then when it gets published can't BN2 publish the full attestation with signatures from VC1+VC2+VC3? It shouldn't matter if BN1 and BN3 can't publish

CC @xenowits

michaelsproul avatar Jan 17 '24 12:01 michaelsproul

The cluster can submit the attestation but the node with this bug misses its duty. In a 3/4 cluster this means if 2 are in this state at the same time, the cluster misses its attestation duty. I don't know the internals, but that's our observation.

On Wed, 17 Jan 2024 at 13:28, Michael Sproul @.***> wrote:

@koraykoska https://github.com/koraykoska Where does Obol get the attestation from though? My point is that at least one of the BNs in the cluster must have produced it, and that BN should succeed in publishing it. Obol should not be attesting to blocks that aren't validated by any of the BNs in its cluster.

To be clear, we have something like this, don't we?:

VC1 -> Obol1 -> BN1 VC2 -> Obol2 -> BN2 VC3 -> Obol3 -> BN3

If BN2 produces the attestation, but BN1 and BN3 haven't processed the block yet, then when it gets published can't BN2 publish the full attestation with signatures from VC1+VC2+VC3?

CC @xenowits https://github.com/xenowits

— Reply to this email directly, view it on GitHub https://github.com/sigp/lighthouse/issues/4699#issuecomment-1895709643, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWUTTKBMTBNDLIRXJV4DR3YO67VJAVCNFSM6AAAAAA4K7N2P2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJVG4YDSNRUGM . You are receiving this because you were mentioned.Message ID: @.***>

koraykoska avatar Jan 17 '24 13:01 koraykoska

Thanks @koraykoska. That seems strange to me, so I'll ask the Obol guys to clarify. My understanding was that once the cluster had produced a "full" attestation with aggregated signatures from all participants, it would be able to publish it using any single beacon node.

michaelsproul avatar Jan 17 '24 22:01 michaelsproul

Confirmed with Obol that the complete attestation is published via all BNs, so the attestation should succeed as long as at least 1 node has finished processing the block and is capable of validating the attestation.

@koraykoska If you check your validator's performance for the slots where your BN fails, do you notice attestation penalties?

Either way, we will fix this shortly after 4.6 with #5010

michaelsproul avatar Jan 18 '24 06:01 michaelsproul

Closed by https://github.com/sigp/lighthouse/pull/5010 which is in the upcoming v5.0.0 release :tada:

michaelsproul avatar Feb 20 '24 22:02 michaelsproul