Batch attestation slashability checking
Description
With large numbers of validators and heavy disk congestion, we've observed the slashing protection database timing out with errors like this:
: Nov 16 00:50:31.056 INFO Successfully published attestation type: unaggregated, slot: 39850, committee_index: 2, head_block: 0x3bd75ebc65de718b36911eaab7dad3a9ef7ca44c7ba03d32c7092195c43bcf43, service: attestation
: Nov 16 00:50:32.838 CRIT Not signing slashable block error: SQLPoolError("Error(None)")
: Nov 16 00:50:32.838 CRIT Error whilst producing block message: Unable to sign block, service: block
The default timeout is 5 seconds.
Presently, we sign attestations one at a time and broadcast them, which means that each attester requires a new SQLite database transaction.
https://github.com/sigp/lighthouse/blob/master/validator_client/src/attestation_service.rs#L332-L404
To alleviate the congestion slightly, we could switch to an algorithm like:
- Begin an SQLite transaction
txn - Check and sign all attestations as part of
txn - Commit
txn - Broadcast attestations
That way we preserve the property that an attestation is only broadcast if its signature has been persisted to disk (which is crash-safe). Broadcasting after each check but before the transaction commits would violate this property.
Version
Lighthouse v0.3.4
Another potential advantage to batching like this: we might be able to harden SQLite against crashes by performing a disk sync after each batch. Although we would still be vulnerable to disks that don't respect fsync (section 3.1 here https://www.sqlite.org/howtocorrupt.html)
I tried to implement this atop the Web3Signer PR (#2522) just now. I quickly ran into borrowck hell, as the sqlite library doesn't support sharing a Transaction across threads (it is not Send), see: https://github.com/rusqlite/rusqlite/issues/697.
This doesn't currently cause problems at compile time because each call to sign_attestation begins its own local transaction which commits before the async call to the background thread/remote signer is made. However, the database only supports 1 active transaction at a time (for safety), so this is a bottleneck, and will hamper the parallel requests to the remote signer, i.e. all the tasks will need to individually obtain the transaction lock and commit before they hit the remote :(
So far the best solution I can think of is quite spicy :hot_pepper:, and changes one of the fundamental invariants of the validator client. Rather than enforcing:
(1) Only sign an attestation if it is safe to do so
We could enforce:
(2) Only broadcast an attestation if it is safe to do so
This would let us make the slashability checks in one batch after all of the signatures had been created in parallel. No need to hold a lock on a Transaction across await calls, and no lock contention because we just lock and write once. For this to really work though we would need to refactor the attestation service. Currently it spawns one task per committee and index, with a broadcast happening in each task. I need to think more about whether refactoring it to sign all attestations, store all attestations, broadcast all attestations is feasible, and whether it would introduce any other issues.
Alternatively, a more involved solution would be to modify the slashing protection database's locking paradigm so that it can lock on a per-validator basis. We should never be signing more than 1 attestation for the same validator in a single slot, so this would be ideal, but it seems that we wouldn't be able to (easily) achieve this with SQLite: https://softwareengineering.stackexchange.com/questions/340550/why-are-concurrent-writes-not-allowed-on-an-sqlite-database
Alternatively, a more involved solution would be to modify the slashing protection database's locking paradigm so that it can lock on a per-validator basis. We should never be signing more than 1 attestation for the same validator in a single slot, so this would be ideal, but it seems that we wouldn't be able to (easily) achieve this with SQLite:
It seems to me using another DB should be considered if SQLite doesn't support desirable features that lighthouse could/should support. For example, if it allowed easier implementation of #2557, which is important to me, I'd go in that direction.
We've had a report in Discord about the web3signer endpoint taking more than 20 seconds to return a request. I'm thinking it has to do with lock contention on the slashing protection DB. The same user also reported that attesting validators have missed attestations while new validators are being created with the web3signer endpoint.
Gonna start on this for Holesky's sake.
I never got around to implementing this, but would be happy to pair on it with someone
I'm interested in working on this
@eserilev that would be awesome! Let me know if you have any Qs
Made some decent progress here!
I have a POC branch here: https://github.com/eserilev/lighthouse/tree/attestation-slashability-poc
Heres a quick summary of the changes made so far:
- First we spawn threads on a per validator basis to sign attestations. This ends up returning a list of signed attestations. Note that this is the only point at which we spawn multiple threads. Everything else below is single threaded
- Using the list of signed attestations we check and insert (if safe to do so) attestations into the slashing protection db
- as a small optimization, we could insert in bulk.
- another small optimization is to share the txn across all the db calls.
- this should be kept single threaded IMO.
- if a signed attestation isn't safe, we get rid of it.
- We then broadcast the safe attestations. There doesn't seem to be a need to spawn multiple threads to do this step as the
post_beacon_pool_attestationsbeacon api endpoint allows us to post a list of attestations. - As a final step we publish the attestation aggregates
The code itself is fairly messy, was really just trying to get something working on a kurtosis testnet. The local testnet I'm running has 5000 validators running against a single BN. It has 100% attestation performance and has reached finality. Looking at the vc logs, I'm not seeing any timeout issues or other related errors.
Next steps for me is to clean this up, run some more local tests and prepare it for a potential deploy to a Holesky validator