beacon-APIs Add liveness endpoint for doppleganger protection

What

Following from #64, this PR adds a /eth/v1/validator/liveness endpoint which indicates if a validator has been observed to be "live" in some epoch. The idea is that doppleganger protection should only start if some validator has is_live: false for some number of epochs.

This PR differs from @dankrad's proposal in that it doesn't ask "give me the latest attestation", instead it asks the broader question "has the validator been observed to be live?". This has the following advantages:

It's an easier question to answer. Clients already need to store caches for observed attesters/aggregators/proposers that answer the boolean question "has this validator already been observed". Keeping a cache of old attestations for validators is more of a burden.
It's a less specific question so clients are welcome to include more "liveness" sources if they see fit. It's also more future-proof since it might also evolve to encompass future message types.

Why

This provides doppleganger protection as a first-class citizen. Given that clients already need to answer the "has this validator been observed" question for gossip message verification, this seems cheap to implement. If we add this endpoint then things like Infura can also support doppleganger protection.

I understand that Nimbus has already implemented doppleganger protection (cc @arnetheduck) and Lighthouse is in the later stages of implementing it (see #2230).

Example

Request

POST /eth/v1/validator/liveness

{
  "epoch": "1",
  "indices": [
    "1"
  ]
}

Response

{
  "data": [
    {
      "index": "1",
      "epoch": "1",
      "is_live": true
    }
  ]
}

Apr 18 '21 00:04 paulhauner

I'm not sure why CI is failing, I had a look but it's not clear to me. Perhaps someone with more CI skills could nudge me in the right direction? :pray:

Apr 18 '21 01:04 paulhauner

Quick question on how is this intended to be used? It only queries for a single epoch, so I assume a validator client would have to poll a few (2-5) epochs to be certain?

Maybe it's worth adding a short sentence about this intended use.

Apr 20 '21 14:04 dankrad

@paulhauner Shouldn't this be part of the eth/v1/beacon. Ultimately you are asking the question "Who are the validators being observed to be live at this (end of) epoch?" no? Which the beacon-chain will answer similar to other requests like ListValidators /eth/v1/beacon/states/{state_id}/validators or attestations etc..

May 12 '21 06:05 ahadda5

@paulhauner Shouldn't this be part of the eth/v1/beacon. Ultimately you are asking the question "Who are the validators being observed to be live at this (end of) epoch?" no? Which the beacon-chain will answer similar to other requests like ListValidators /eth/v1/beacon/states/{state_id}/validators or attestations etc..

I'm trying to determine if the validator has been seen to sign any message during the given epoch to try and make a guess about whether or not there's some other VC running the same pubkey. This is different to asking if it exists in the state, has an attestation included in a block or if it is included in the attestation pool.

May 12 '21 06:05 paulhauner

I'm trying to determine if the validator has been seen to sign any message during the given epoch to try and make a guess about whether or not there's some other VC running the same pubkey. This is different to asking if it exists in the state, has an attestation included in a block or if it is included in the attestation pool.

Got it. I looked at the other requests and saw their folder structure. I was confused as to where this get belongs. Thanks

May 12 '21 08:05 ahadda5

There's no information in here on how far back the "epoch" parameter can/should go. Is it expected that this could be called for any epoch, or for only the most recent epochs? And if the latter, what would a reasonable number of epochs be? Perhaps back to the last finalized epoch?

Feb 21 '22 09:02 mcdee

@paulhauner could you check "Allow edits from maintainers" or update yourself? :)

Feb 21 '22 15:02 mpetrunic

Thanks for raising this PR - we're in the process of adding doppelganger support to our VC, so getting this settled would be nice.

There's no information in here on how far back the "epoch" parameter can/should go.

Indeed -an easy solution here is to simply remove it - in Nimbus, we only keep track of a single "last-seen" data point, which is sufficient for doppelganger, or "liveness" in general.

is_live

This flag seems redundant - I think we can capture the essence of this API by simply returning the last epoch (or indeed slot) and let consumers decide on their liveness policy.

epoch

Possibly, we could increase granularity here to slot.

Jun 30 '22 08:06 arnetheduck

I'll throw my support behind this one. Post-merge, Rocket Pool has seen several people shuttling nodes around to migrate to different clients with better performance, and in some situations exporting & importing the slashing DB isn't easy to do. Doppelganger detection is a common way to miss attesting on purpose in cases like this, but we had one instance where someone tried to attach a Lighthouse VC to a Nimbus BN and it failed because the routes aren't standardized.

Having a standardized liveness route so any VC with doppelganger could work with any BN would be very well received by our node operators.

Sep 19 '22 04:09 jclapis

Alrighty, I have updated this PR. My apologies for letting it linger. Thanks for @rolfyone for prodding me (several times) :pray:

There's no information in here on how far back the "epoch" parameter can/should go.

@mcdee, good point. I've added that the BN MUST support the current and previous epoch and MAY support previous epochs. Based on our implementation in Lighthouse, this seems to be the minimally useful set of information.

Indeed -an easy solution here is to simply remove it

@arnetheduck, removing epoch is a reasonable request. What it would remove is the ability to know that a validator was live in the current epoch but not the previous one. It's been a while since we implemented DG, so I'm not sure if this is an issue or not. Removing the epoch parameter makes the API slightly simpler, keeping it makes it slightly more powerful. I'm ambivalent and therefore tempted to leave it as is. I'm open to arguments.

Possibly, we could increase granularity here to slot.

The reason that I have specified epoch is since P2P validation conditions already require nodes to track which epochs they've seen attestations, however it doesn't require tracking specific slots. One of the goals of this PR was to be minimally intrusive. I'm open to switching to slots, but I think the fact that clients already have the groundwork for epochs is an appealing reason to leave it at that.

Oct 19 '22 22:10 paulhauner

I'm open to arguments.

I think without epoch, the query poses a different question: "when was the validator last active" vs "was the validator active in epoch X" - answering the latter requires book-keeping of historical "activeness" which complicates the implementation

epoch in both request and response

as specified, it's somewhat ambiguous what the epoch parameter in the answer should be - if a validator was seen to sign something in epoch 5 and a query is done for epoch 6, should the validator be considered active or not (is_live)? if not, why is the epoch both in request and response? under what circumstances would epoch in the response differ from that of the request?

current and previous epoch

under what circumstances do we need to know two epochs for doppelganger, and why two specifically? it seems sufficient to know when the validator last published something - this allows configuring different paranoia policies (number of inactive epochs) for liveness without having to do multiple (potentially racy) queries. limiting it to "previous" and "current" constrains the implementation to that number of "inactive" epochs (one-and-a-few-slots)

Oct 20 '22 21:10 arnetheduck

@arnetheduck, I can see two points in your previous post:

Why should we track liveness per epoch rather than "highest live epoch".
Why is the epoch returned in the response.

1. Why track liveness per epoch

Firstly, we already need to track per-epoch liveness to follow the P2P spec. For instance, if you've seen an attestation in epoch n but not n - 1 you still need to propagate the epoch n - 1 attestation. Tracking just the highest seen epoch would not permit the client to follow the P2P spec.

Secondly, we should note that the P2P spec only requires the node to track the current + previous epochs (potentially the next epoch too, if you consider clock disparity).

Given those two points, let's assume that all nodes have at least this abstract data structure in order to meet the P2P propagation conditions:

struct AttestationSeen {
  previous_epoch: Set<ValidatorIndex>,
  current_epoch: Set<ValidatorIndex>,
}

If we accept that the above struct is the minimum required struct to meet the P2P propagation conditions, then I think it's clear that the best way to query it is by epoch. I argue that querying this struct for "highest seen epoch" is strictly worse for the following reasons:

It's impossible to know the liveness status of the previous epoch if the validator was live in the current epoch.
There is no way for the server to communicate that the requested epoch is out of range. If the epoch is provided in the request then the server can say "that epoch is out of range", this reduces edge-cases when the client and server have clock sync issues and also allows the server to communicate the lowest epoch it supports. If the client asks for the highest epoch then it gets no information back from the server about their respective clock sync or which epochs the server actually tracks.

So, I argue that "highest live epoch" is:

A complication when we consider what clients are already required to track.
Provides strictly less information.

The upside of "highest live epoch" is that it can smoosh together information about the current and previous epoch to reduce the number of calls to the BN. You made a point about "racy" queries, I reject that claim because the liveness of epoch n - 1 and n are independent values. I also argue that doppelganger queries are light enough and infrequent enough that the gains of optimising functionality away from them is premature.

2. Why is the `epoch` returned in the response.

The epoch in the request is always the same as the response. I don't recall why this was added (we implemented this over a year ago). Perhaps we can remove it, I don't really mind.

Summary

Unless there's a glaring hole in my logic here, I have a rather small appetite for arguing this further. It matters little to me about what we do here, I'm sure we can just shim your approach into ours if need be. If my reasoning is sound, but you still want to optimise this API call then feel free to make a new PR for it. If that gets merged rather than this then Lighthouse will implement it.

Oct 20 '22 23:10 paulhauner

I'm in line with your logic @paulhauner , thanks for the detailed description and your hackmd - I find it all really logical...

To me, there's been no discussion for a long time, so I think that we were happy previously with the logic also. Will flag this to make sure ppl read it on discord, but I'm inclined to approve and merge unless theres strong objections raised...

Oct 21 '22 06:10 rolfyone

SGTM, thanks for the rationale
Tendentially it seems better to avoid redundant data as it opens up for ambiguity and implementation bugs/complexity (ie implementations must now check that the epoch is indeed correct) for no gain - annoying, but not a big deal

Oct 23 '22 07:10 arnetheduck

Tendentially it seems better to avoid redundant data as it opens up for ambiguity and implementation bugs/complexity (ie implementations must now check that the epoch is indeed correct) for no gain - annoying, but not a big deal

I had a look around the other endpoints and it seems that not including the epoch in the response would be most consistent with the rest of the repo. I also think your argument makes sense. I've removed the epoch in 4bbefdb.

Oct 23 '22 21:10 paulhauner

Nitpick:

Why track liveness per endpoint

It should say epoch, not endpoint.

Oct 24 '22 09:10 rkapka

Add liveness endpoint for doppleganger protection

What

Why

Example

Request

Response

1. Why track liveness per epoch

2. Why is the epoch returned in the response.

Summary

2. Why is the `epoch` returned in the response.