beacon-APIs icon indicating copy to clipboard operation
beacon-APIs copied to clipboard

Checkpoint Sync API

Open mkalinin opened this issue 3 years ago • 33 comments

Specification

GET /eth/v1/checkpoint/finalized_state

  • Returns BeaconState object for a finalized checkpoint state from the WS period

GET /eth/v1/checkpoint/finalized_blocks/{slot}/root

  • Returns 404 if a block at a slot is either unavailable or not yet finalized
  • Otherwise, returns beacon block root

Motivation

Facilitates checkpoint sync adoption by simplifying the following scheme:

  • [Few] State provider(s) supply a state at a finalized checkpoint within WS period
  • [Many] Trust providers expose /eth/v1/checkpoint/finalized_blocks/{slot}/root endpoint to allow for checking that a block in the pulled state is finalized

The finalized_blocks/{slot}/root endpoint is a shortcut to the following actions:

  1. GET /eth/v1/beacon/blocks/{slot}/root
  2. GET /eth/v1/beacon/headers/finalized
  3. Check that slot <= finalized_header.slot

The finalized_state endpoint is an alias to GET /eth/v2/debug/beacon/states/finalized

cc @ajsutton @djrtwo

mkalinin avatar Jul 29 '22 11:07 mkalinin

From call: These probably shouldn't be part of the existing APIs because they are meant for third parties to connect to, rather than the operator to connect to (like existing APIs). This means exposed on a different port so it is easy to expose to the internet without exposing everything else to the internet. It may have separate rate limiting and firewalling, which is easier if it is a separate port.

The trust endpoint in particular we want as widely available as possible, so we should make it very easy for people to expose that to the internet publicly with minimal effort/risk.

MicahZoltu avatar Aug 04 '22 14:08 MicahZoltu

Another point mentioned in the call was the additional functionality that client teams may need to implement in order to checkpoint sync with only a finalized state & a block root.

At the moment checkpoint sync requires differing sets of data depending on the client implementation. For example the complete set of routes to provide checkpoint sync for all clients is something like this:

  • GET /eth/v1/beacon/states/head/finality_checkpoints

  • GET /eth/v1/beacon/blocks/genesis/root

  • GET /eth/v2/beacon/blocks/{finalized/block_root}

  • GET /eth/v2/debug/beacon/states/{genesis/finalized/state_root}

samcm avatar Aug 04 '22 14:08 samcm

These probably shouldn't be part of the existing APIs because they are meant for third parties to connect to, rather than the operator to connect to

Notably, downloading a state is very expensive (100+mb in json) - nobody running a beacon node with validators attached would be advised to be supplying beacon states (for their own good) - conversely, for beacon nodes that don't support active validators, there is little harm in exposing all the REST API - the beacon state call is by far one of the most expensive ones you can expose anyway, making most of the rest benign.

arnetheduck avatar Aug 08 '22 20:08 arnetheduck

the complete

GET /eth/v2/beacon/blocks/{block_root} - in the libp2p protocol, clients are not required to keep a root -> block index - they are required however to keep a slot -> block index for the relevant WS period (to support by-range requests - by-root requests are only supported for the non-finalized period, and checkpoint syncing should likely follow the same constraints - notably, fetching blocks for the entire WS period is the only way to not violate the libp2p spec (while backfilling, clients are in violation of the libp2p spec by not providing historical data and may be disconnected) so the block request is really "required" for a full checkpoint sync.

genesis_root is interesting in that it's usually part of the "metadata" of the chain (in addition to a state, a chain config is also needed), but not necessarily - getting the genesis state or root via API is one way of avoiding this having to be given at startup, but commonly, the genesis state is available from chain metadata, thus fetching it is not really needed.

GET /eth/v2/debug/beacon/states/{genesis/finalized/state_root}

The state_root lookup is problematic in that there is no requirement elsewhere in the protocol as to which states should be indexed by state root (most clients store only states at certain intervals, and some don't support by-state-root lookups at all because they're expensive) - thus a requirement to index by state roots would require specifying exactly which states should be "fetchable" - restraining this to the finalized state is one way to do this.

arnetheduck avatar Aug 08 '22 20:08 arnetheduck

Notably, downloading a state is very expensive (100+mb in json) - nobody running a beacon node with validators attached would be advised to be supplying beacon states (for their own good)

We should specify that these endpoints only return SSZ at least for the state. There's no reason to serialize to a much bigger json representation in this case. Even so I don't think we're targeting nodes that are running validators here - for security reasons alone I wouldn't recommend exposing any APIs from a node running validators.

conversely, for beacon nodes that don't support active validators, there is little harm in exposing all the REST API - the beacon state call is by far one of the most expensive ones you can expose anyway, making most of the rest benign.

This is definitely incorrect. The /eth/v1/beacon/states/:stateId/validators endpoint is actually the most expensive and problematic API currently. And the fact that it can request any state makes it dramatically more expensive than these proposed APIs even if both supported JSON since it means caching is ineffective. We've got a lot of experience exposing the REST API publicly and there are a lot of minefields there.

ajsutton avatar Aug 08 '22 22:08 ajsutton

Another point mentioned in the call was the additional functionality that client teams may need to implement in order to checkpoint sync with only a finalized state & a block root.

At the moment checkpoint sync requires differing sets of data depending on the client implementation.

I would say the point of these new APIs to to define a small set of APIs that clients will need to work with. It will mean clients needing to do additional work so they can start from just a BeaconState but otherwise it's significantly harder to be a state provider which makes it hard to convince people to provide them (and basically centralizes on Infura). While we don't need a lot of state providers, given that trust providers are separate, we still want more than one or two and we want to make it easy for them to be reliable.

ajsutton avatar Aug 08 '22 23:08 ajsutton

The /eth/v1/beacon/states/:stateId/validators endpoint is actually the most expensive and problematic API currently.

I guess this is client-dependent as far as implementation / expensiveness goes (ie how costly it is to generate a "filtered" response) - but given that the full state is a superset of the validators, the responses are certainly larger -> more expensive (from a bandwidth perspective).

+1 agree on the point that limiting to finalized is a good idea for cache:ability - that said, this is equivalent to simply not responding to non-finalized requests in the current {:state-id} based API (ie just because by-state-root requests are specified does not mean that they need to be enabled) - therefore it might be more flexible to allow a full state id, but specify that clients wanting to a checkpoint sync "SHOULD" use finalized.

As such, one would expose /eth/v1/beacon/states/{:state-id} (to make this a non-debug API) making this an entirely regular and consistent request in line with the existing calls - clients that wish to constrain this can respond 501 or 503 when state-id is not finalized and we add guidance documentation for "checkpoint-sync-consumption" instead, so that clients converge on a single way to get the state.

+1 that not exposing JSON for SSZ might be a good idea, but even the SSZ is ~60mb and growing - this is not a light request by any means - again, we can solve this with guidance docs for client implementers (which would only use SSZ or use quality preferences to prefer ssz etc).

[Many] Trust providers expose

What is a "trust provider" in this context, and where do you get a list of them (since there are many)? If it's a well-known list, they are open to MITM, DoS, blocking etc, so we need to tread carefully - if instead they are random nodes on the internet used for a majority decision, it might be simpler to expose this via libp2p and get discovery for free.

arnetheduck avatar Aug 09 '22 07:08 arnetheduck

What is a "trust provider" in this context, and where do you get a list of them (since there are many)? If it's a well-known list, they are open to MITM, DoS, blocking etc, so we need to tread carefully - if instead they are random nodes on the internet used for a majority decision, it might be simpler to expose this via libp2p and get discovery for free.

I am envisioning people on the internet who run Beacon nodes just telling their social network where it is. This may be private between a group of friends, or it may be a tweet, or maybe you post it on the sidebar of your blog. Some institutional providers may make it available more widely as a way of building a positive reputation as well.

I'm definitely not envisioning some well known list of "these are the trusted trust providers". It should be up to each individual to define their trust network (or delegate that to someone they trust).

MicahZoltu avatar Aug 10 '22 10:08 MicahZoltu

I am envisioning people on the internet who run Beacon nodes just telling their social network where it is.

If this is the use case, they might as well provide a state - it's a one-off, and it's not that bloody (unless you run a public service, in which case you rate limit it and then you're done) - ie it's important to keep in mind that there are two kinds of beacon nodes: those with validators and those without - for the former, you will not want to expose any API and for the latter category, there is really very little / no harm in exposing the entire API (minus keymanager of course).

arnetheduck avatar Aug 10 '22 19:08 arnetheduck

I would be comfortable sharing my checkpoint endpoint on Twitter and linking in the sidebar of my blog or something, where the reach is not huge, but also unbounded. I would not be comfortable sharing my state endpoint with that broad of an audience though. As an anecdote: I previously exposed my execution JSON-RPC API but didn't advertise it (just a DNS record and referenced it internally in an app I built) and I find it being used by random dapps and people throughout the ecosystem to this day, despite me never publicly linking to it anywhere.

The problem with the state endpoint is that I have to do something to protect myself, but with the checkpoint endpoint I likely don't have to do anything to protect myself (besides normal stuff that I already have for any publicly facing service). Even just a little bit of discouragement is enough to make people not expose a thing, and we want as many checkpoint endpoints available as possible.

MicahZoltu avatar Aug 11 '22 06:08 MicahZoltu

Alternatively, we may have GET /eth/v1/checkpoint/finalized_block_state returning bundled block and state at the finalized checkpoint, SSZ encoded:

class BeaconBlockAndState(Container):
  block: BeaconBlock
  state: BeaconState

mkalinin avatar Sep 08 '22 14:09 mkalinin

Alternatively, we may have GET /eth/v1/checkpoint/finalized_block_state returning bundled block and state at the finalized checkpoint, SSZ encoded

I would be in favour of this for Lighthouse as the assumption that blocks and states come in pairs is quite deeply embedded. I'm not saying it would be impossible to remove but it would likely represent a substantial amount of work, particularly testing that no assumptions about block existence are violated. We often use the existence of a block in our database to determine if it is known/canonical, and currently block processing expects to be able to verify each block against its parent block + state.

Another related issue for us is that we currently require the checkpoint block to be from a non-skipped slot, although I think this requirement would be easier to remove (see https://github.com/sigp/lighthouse/issues/3210).

michaelsproul avatar Sep 09 '22 00:09 michaelsproul

Alternatively, we may have GET /eth/v1/checkpoint/finalized_block_state returning bundled block and state at the finalized checkpoint, SSZ encoded:

This is the Engine API, or a different API?

MicahZoltu avatar Sep 09 '22 07:09 MicahZoltu

This is the Engine API, or a different API?

Nah, the beacon API served by consensus clients (i.e. this repo). Spec is rendered online here: https://ethereum.github.io/beacon-APIs/

michaelsproul avatar Sep 09 '22 07:09 michaelsproul

Ah. As mentioned above, I think checkpoint stuff should be exposed on different ports because the expectation is that a different set of people will have access to these things.

Finalized Slot Root: Everyone Checkpoint State: Friends & Family Beacon API: Just Me

MicahZoltu avatar Sep 09 '22 08:09 MicahZoltu

I agree on the port different to what we have for beacon APIs. And I also think that CL clients should have at least the following set of flags:

  • --checkpoint-api-enabled[=<BOOLEAN>], default: false
  • --checkpoint-api-port=<INTEGER>, default: some_port
  • --checkpoint-api-state-enabled[=<BOOLEAN>], default: false

State endpoint shouldn't be enabled by default as it's heavy. I guess setting an HTTP server with caching in front of checkpoint sync state endpoint would be an optimal strategy for state providers. They could set say 1h TTL for this cache

mkalinin avatar Sep 09 '22 12:09 mkalinin

I think that this conversation has veered out of scope for an endpoint definition. In my opinion, this endpoint should be added to the beacon APIs same as the others, and made available on the same basis as the rest of the endpoints.

If there is a desire for this endpoint to be exposed to external/untrusted parties it might require any number of features such as authentication, DDoS protection, rate limiting, internal caching etc. that don't fall under the remit of an API specification. These can all be accomplished better by a piece of middleware that provides these features than attempting to add them to multiple beacon client implementations.

So please: no dedicated port, no specific options in the CL for this. Let's build an endpoint definition here and handle presentation of it elsewhere.

mcdee avatar Sep 09 '22 12:09 mcdee

So please: no dedicated port, no specific options in the CL for this. Let's build an endpoint definition here and handle presentation of it elsewhere.

I think the chance of a user shooting themselves in the foot is higher, and the chance of users providing these APIs is significantly lower if we just throw them all onto the Beacon API and tell users "if you want to provide a public good, you can do a bunch of additional work to make that happen". At the least, the checkpoint API is incredibly lightweight and probably safe for the vast majority of people to just expose freely.

  • --checkpoint-api-enabled[=<BOOLEAN>], default: false
  • --checkpoint-api-port=<INTEGER>, default: some_port
  • --checkpoint-api-state-enabled[=<BOOLEAN>], default: false

I think there should be another for --checkpoint-api-state-port=<INTEGER>, default: some_other_port.

MicahZoltu avatar Sep 09 '22 13:09 MicahZoltu

At the least, the checkpoint API is incredibly lightweight and probably safe for the vast majority of people to just expose freely.

It's a DoS vector, especially given the size of the state. But again, this is down to the implementation and has no impact on the inputs, processing or outputs of the endpoint which is what should be being discussed here.

mcdee avatar Sep 09 '22 14:09 mcdee

GET /eth/v1/checkpoint/finalized_blocks/{slot}/root

Since this is the only unique API endpoint in this proposal (the others already exist and are easy to expose one-by-one anyway), an alternative is that we add, to the "ordinary" block requests (like getBlocksV2), finality information similar to execution_optimistic.

Anything else, we relegate to a separate spec or document that outlines the endpoints that checkpoint-syncing clients use (ie "when performing checkpoint sync, clients must use only requests XYZ in SSZ encoding" - the rest (separate ports, endpoint options in clients etc) will naturally fall into place after that.

The goal would be to keep the API orthogonal and not keep multiple endpoints exposing the exact same data - one can imagine that having is_finalized on getBlocksV2 and its more detailed friends is something any consumer would want to know.

arnetheduck avatar Sep 09 '22 14:09 arnetheduck

I agree with @mcdee on the following:

this conversation has veered out of scope for an endpoint definition

Having these endpoints as part of Beacon API is an option, but in this case CL clients must have CLI options to enable checkpoint API on a different port for UX reasons (we don't have to specify the latter in Beacon APIs). State and trust providers UX is the main goal behind this proposal.

Ideally, we should have a built-in DoS protection for the state endpoint as well (it can be done by caching the state in memory). So, it would make life of state providers easier. But state providers are likely to be experienced node operators capable of providing their own DoS protection.

The block root endpoint must have a built-in DoS protection (if the protection is required) making exposing of it as easy as --cp-api --cp-api-host 0.0.0.0 that average node operator can do. If we expect trust providers to setup any DoS protection on their own then I don't see any point in having a separate checkpoint sync API at all and we may shut down this conversation.

Since this is the only unique API endpoint in this proposal

This is not 100% true.

The state endpoint is currently a part of /debug namespace and I think it or its shortcut should be moved out of this scope if we want it to be exposed publicly.

I still think we should come to a place where a single endpoint provides all the data required to bootstrap a node via checkpoint sync. So, we either change CL clients behaviour or have block and state bundles served by the new endpoint which was proposed a bit earlier in this thread.

mkalinin avatar Sep 12 '22 10:09 mkalinin

Regarding providing a block, one reason we've been holding back the refactoring necessary for dealing with block-less states is the difficulty that happens when the zero:eth block in the slot is empty - this brings a few conundrums:

  • should we supply the "latest block" for empty slots? there may be multiple slots of gap between the state and the block - spec-wise, these gaps can even extend past the 8192 blocks which we keep track of in the state and clients have to "correctly" handle this case (or at least be deliberate in providng a
    • if we only allow checkpoint states that have a block in the same slot, the problem is simplified by constrains the spec
    • if we allow skip slots, the state root not present anywhere else in the protocol, except the state_roots table of "future" states - ie the checkpoint will be a state whose state root never appears in a block

arnetheduck avatar Sep 12 '22 12:09 arnetheduck

The state endpoint is currently a part of /debug namespace

we haven't really clarified what that means though - ie it's a standardised call like any other at this point and debug is just a name/label.

if we want it to be exposed publicly.

this is slightly orthogonal - ie we can expose a debug endpoint - or even a debug endpoint constrained to SSZ alone - on a separate port even with the current "structure" if we want to - there's already a unique URL that can be used for filtering - but I'm still curious about the balancing act of doing so: the state request is (by far) the heaviest and most intensive request - once that is exposed to the public, that node is virtually useless for staking node so there is very little harm in exposing some of the other API as well.

arnetheduck avatar Sep 12 '22 12:09 arnetheduck

the state request is (by far) the heaviest and most intensive request

The non-debug validators endpoint is actually the one that requires the most work to generate and causes the most problems when exposing these APIs to the public. You need both the validators array and balances combined together, potentially apply filter and it generates a significantly larger JSON response than getting the state.

More importantly the API proposed here to return a state allows the node to select which state to return whereas the existing API allows any state to be requested. Sending a heap of requests for random states across the history of the chain is dramatically more expensive to serve than just returning a state that the node chooses. For example the node could just cache the latest finalized state in memory or on disk and stream the static response without doing any processing at all.

should we supply the "latest block" for empty slots? there may be multiple slots of gap between the state and the block - spec-wise, these gaps can even extend past the 8192 blocks which we keep track of in the state and clients have to "correctly" handle this case (or at least be deliberate in providng a

I'd say the state provided needs to be from a slot that had a block in it, otherwise the handling gets significantly more complex. And to work with the STATUS exchange in networking that block needs to be the one that would be used for an epoch checkpoint. So it's the block from the first slot of the epoch or if that's an empty slot the most recent block before that. To provide the most up to date state possible, you'd just get the block root from the current finalized checkpoint and get the state for that block.

The API already requires finding a block by root so there shouldn't be any issues with validating that a block identified by root is from a specific slot regardless of the number of empty slots.

ajsutton avatar Sep 12 '22 23:09 ajsutton

So it's the block from the first slot of the epoch or if that's an empty slot the most recent block before that.

I think this insight is what we need to solve the alignment problem in Lighthouse's current implementation (https://github.com/sigp/lighthouse/issues/3210). Lighthouse needs the checkpoint state to lie on an epoch boundary because our database schema only stores states on epoch boundaries, so I think what we'll do is download the finalized block's state (which may be from the previous epoch) and advance it into the next epoch. This begs the question of how far to advance it though, as I think it's possible for a block from epoch $n$ to be finalized at epoch $n + 2$ if all the blocks in epoch $n + 1$ are skipped:

finalize_with_skip

To set the finalized state to the state from the start of epoch $n + 1$ would be incorrect, and might allow an attacker to feed us blocks from $n + 1$ that conflict with finalization without us knowing. Also, there's nothing particular to this example that requires $B$ to be unaligned, we'd have a similar problem if $B$ were in the first slot of $n + 1$ (so I think this issue applies to all checkpoint sync implementations today, already).

If we wanted to address this it might make sense to bundle the finalized epoch in the BeaconBlockAndState container proposed by @mkalinin, with the semantics that the finalized state is the result of advancing the provided state to that epoch.

michaelsproul avatar Sep 13 '22 00:09 michaelsproul

To set the finalized state to the state from the start of epoch n+1 would be incorrect, and might allow an attacker to feed us blocks from n+1 that conflict with finalization without us knowing.

Why is that a problem? As long as the state is from within the weak subjectivity period they won't be able to make you finalise something incorrect and so you'll wind up finding the right chain and following it. You are basically in the same situation as you would have been had you been in sync and following the chain from epoch n onwards.

You would get a problem if n+2 was within weak subjectivity but the state was from before weak subjectivity period but the requirement is that the state itself be within the weak subjectivity period.

So to answer the specific question, you'd process slots to move the state forward to the start of the next epoch. You know there weren't any blocks in that period (or the state should have been from one of those blocks) but you can't tell if there were no blocks for the next epoch as well so can't take it any further forward.

ajsutton avatar Sep 13 '22 00:09 ajsutton

@ajsutton I think my example is actually too conservative, if there's a gap with no blocks that spans the weak subjectivity period then the epoch following block $B$ could be old enough for an attacker to equivocate without penalty. It would be very bad to have the canonical chain lacking blocks for this long, but it could happen as a result of a major internet outage or an inactivity leak + block proposal bugs. This kind of relates to what @arnetheduck was saying about the finalized block being >8192 slots prior to the finalized checkpoint.

I hadn't made this connection before in relation to checkpoint sync, so figured it was worth noting. I agree in practice it's unlikely to be relevant, but if we're adding a new endpoint that bundles the state and a block, we may as well throw the finalized epoch in there too.

michaelsproul avatar Sep 13 '22 04:09 michaelsproul

@ajsutton I think my example is actually too conservative, if there's a gap with no blocks that spans the weak subjectivity period then the epoch following block B could be old enough for an attacker to equivocate without penalty. It would be very bad to have the canonical chain lacking blocks for this long, but it could happen as a result of a major internet outage or an inactivity leak + block proposal bugs. This kind of relates to what @arnetheduck was saying about the finalized block being >8192 slots prior to the finalized checkpoint.

For this to be a problem we'd need to have a block at say slot 10,000, followed by a long gap of multiple weeks so it's longer than the weak subjectivity period to say 110,000 where there are no blocks at all. Then we'd need to include some blocks that contain attestations from that offline period such that we finalise the empty slots but we don't finalize any of the new blocks, and you'd have to by wanting to do a checkpoint sync before finalisation updates to include any new blocks. The attacker meanwhile has to create the alternative chain, with validators that exited and became withdrawal during the period of empty blocks and perform a sybil attack to get you to follow their chain instead of the correct chain from the state you sync'd from. That would be quite impressive and the fix would be to wait until the chain finalizes a new block.

I'm not convinced this is a viable threat let alone that it's a bigger risk than managing to introduce a bug in the extra code required to handle the checkpoint epoch being provided (because you just know you're going to wind up getting a checkpoint epoch that's actually from before the state due to some race condition).

ajsutton avatar Sep 13 '22 04:09 ajsutton

More importantly the API proposed here to return a state allows the node to select which state to return whereas the existing API allows any state to be requested

We would white-list only the finalized state id in the checkpoint consumer documentation and servers would expose only that - there is no additional burden or risk for the server compared to the proposal in this PR: as long as consumers agree to use only finalized, and use only SSZ, that is sufficient for the "servers" to standardise on exposing state/finalized (and maybe block with a general "finalized" flag that is useful for other consumers too)

arnetheduck avatar Sep 13 '22 05:09 arnetheduck

More importantly the API proposed here to return a state allows the node to select which state to return whereas the existing API allows any state to be requested

We would white-list only the finalized state id in the checkpoint consumer documentation and servers would expose only that - there is no additional burden or risk for the server compared to the proposal in this PR: as long as consumers agree to use only finalized, and use only SSZ, that is sufficient for the "servers" to standardise on exposing state/finalized (and maybe block with a general "finalized" flag that is useful for other consumers too)

Altering the path to .../checkpoint/finalized_state does seem to be minor change relating to separate port capability, and only-SSZ and finalized requirements that server will have to check (isn't separate controller already needed to make this work?)

mkalinin avatar Sep 14 '22 11:09 mkalinin