lighthouse
lighthouse copied to clipboard
chore: add rust PeerdasKZG crypto library for peerdas functionality and rollback c-kzg dependency to 4844 version
Issue Addressed
This relies on #5938
Below was copied from Dapplion:
Description: Replace c-kzg peerdas crypto with new lib peerdas-kzg Rationale: Do not touch mainnet paths, and use this lib only for das codepaths Justification: Eventually this lib should become canonical, and we can add a switch later when ckzg peerdas merges to main
Proposed Changes
Please list or describe the changes introduced by this PR.
Additional Info
Please provide any additional information. For example, future considerations or information useful for reviewers.
Thanks @kevaundray! Are there any benchmarks for the peerdas-kzg crate?
Thanks @kevaundray! Are there any benchmarks for the
peerdas-kzgcrate?
Yep, I added some a few hours ago -- you can cd eip7594 and cargo bench -- For a single thread, it should have around the same performance as c-kzg -- for computing cells and proofs, I get ~2x when I set NUM_THREADS to 8. I expect this number to be driven down further when more components are parallelized
Nice! Thanks a lot for this @kevaundray
There are a few things we need to do before merging this, as das is the active branch currently being tested on peerdas-devnet-1:
- Investigate and fix CI failures
- I'd love to see some local devnet testing before we merge this in (I'm using the config from #5968)
- It'd be nice to make this either runtime configurable or behind a feature gate, so we can easily switch between libs without blocking devnet testing in case we run into bugs / perf issues.
If you're busy I'd be happy to keep this moving in the next week, also happy to do some testing.
Investigate and fix CI failures
Seems my change locally didn't get pushed wrt to renaming the library in the workspace, sorry about that. Fixed the other linter complaints.
It'd be nice to make this either runtime configurable or behind a feature gate, so we can easily switch between libs without blocking devnet testing in case we run into bugs / perf issues.
Note that this means that we would need to point c-kzg back to the das branch, whereas this PR reverts it to its 4844 status so that no 4844 code is affected. Is there a way to check if there are performance issues before merging?
Re bugs, the cryptography libraries have their own set of test vectors, so I'd be suprised if a bug is encountered and the test vectors still pass unless the test vectors are lacking (which is also possible)
I'd love to see some local devnet testing before we merge this in (I'm using the config from https://github.com/sigp/lighthouse/pull/5968)
I ran the local devnet script in that PR you linked. I looked at the logs and see no errors except for some sporadic handshake failures (not familiar with the p2p layer, so not sure if this is normal, though it shouldn't be related to this PR I think)
Also see: [el-1-geth-lighthouse] ERROR[...] Nil finalized block cannot evict old blobs
Were you looking for something in particular wrt to testing it locally?
Although we don't use the proofs that get passed to recover_cells_and_kzg_proofs, we need to pass it to the library as a parameter and perform redundant checks (This should eventually be removed, the consensus-specs PR for it is https://github.com/ethereum/consensus-specs/pull/3819)
Unclear to me why this is passing on the DAS branch since we also don't pass proofs into there and its using the old interface where you recover_cells, then covert cells to blob, then compute proofs.
In order to make it work, we would need reconstruct to also pass along the proof for each cell to recover_cells_and_proofs.
EDIT: It passes on the DAS branch because it is not using the most updated version of c-kzg which has all of the extra checks on the input proof
I've pulled out the API change for recover_cells_and_proofs to #5998
@jimmygchen I've ran the ef-tests locally and make lint -- will leave local devnet testing and other blockers in your hand
Sure, I'll do the testing, thanks a lot for this :relaxed:
Sorry about the delay in responding! and just to answer your questions:
Is there a way to check if there are performance issues before merging?
Yep we do have some kzg metrics exposed via the {host}:{metrics_port}/metrics endpoint, and may be useful to compare them:
> curl -s $(kurtosis port print local-testnet cl-2-lighthouse-geth metrics)/metrics | grep -E "HELP.+column"
# HELP beacon_blobs_column_sidecar_gossip_verification_seconds Full runtime of data column sidecars gossip verification
# HELP beacon_blobs_column_sidecar_processing_requests_total Count of all data column sidecars submitted for processing
# HELP beacon_blobs_column_sidecar_processing_successes_total Number of data column sidecars verified for gossip
# HELP beacon_data_column_gossip_propagation_verification_delay_time Duration between when the data column sidecar is received over gossip and when it is verified for propagation.
# HELP beacon_data_column_gossip_slot_start_delay_time Duration between when the data column sidecar is received over gossip and the start of the slot it belongs to.
# HELP beacon_data_column_last_delay Keeps track of the last data column sidecar's delay from the start of the slot
# HELP beacon_processor_gossip_data_column_queue_total Count of data column sidecars from gossip waiting to be verified.
# HELP beacon_processor_gossip_data_column_verified_total Total number of gossip data column sidecar verified for propagation.
# HELP beacon_processor_rpc_custody_column_queue_total Count of custody columns from the rpc waiting to be imported.
# HELP beacon_processor_rpc_verify_data_column_queue_total Count of data columns from the rpc waiting to be verified.
# HELP data_availability_reconstructed_columns_total Total count of reconstructed columns
# HELP data_availability_reconstruction_time_seconds Time taken to reconstruct columns
# HELP data_column_sidecar_computation_seconds Time taken to compute data column sidecar, including cells, proofs and inclusion proof
# HELP data_column_sidecar_inclusion_proof_verification_seconds Time taken to verify data_column sidecar inclusion proof
# HELP kzg_verification_data_column_single_seconds Runtime of single data column kzg verification
Were you looking for something in particular wrt to testing it locally?
I think the main thing to look for when testing peerdas is making sure the network continue to produce blocks and nodes are in sync with each other - this means columns are being produced and propagated and likely works end-to-end. And if we want to make sure it interops with c-kzg, it might be worth running the network-params-das-interop.yaml one (with prysm and teku).
Sure, I'll do the testing, thanks a lot for this :relaxed:
Sorry about the delay in responding! and just to answer your questions:
Is there a way to check if there are performance issues before merging?
Yep we do have some kzg metrics exposed via the
{host}:{metrics_port}/metricsendpoint, and may be useful to compare them:> curl -s $(kurtosis port print local-testnet cl-2-lighthouse-geth metrics)/metrics | grep -E "HELP.+column" # HELP beacon_blobs_column_sidecar_gossip_verification_seconds Full runtime of data column sidecars gossip verification # HELP beacon_blobs_column_sidecar_processing_requests_total Count of all data column sidecars submitted for processing # HELP beacon_blobs_column_sidecar_processing_successes_total Number of data column sidecars verified for gossip # HELP beacon_data_column_gossip_propagation_verification_delay_time Duration between when the data column sidecar is received over gossip and when it is verified for propagation. # HELP beacon_data_column_gossip_slot_start_delay_time Duration between when the data column sidecar is received over gossip and the start of the slot it belongs to. # HELP beacon_data_column_last_delay Keeps track of the last data column sidecar's delay from the start of the slot # HELP beacon_processor_gossip_data_column_queue_total Count of data column sidecars from gossip waiting to be verified. # HELP beacon_processor_gossip_data_column_verified_total Total number of gossip data column sidecar verified for propagation. # HELP beacon_processor_rpc_custody_column_queue_total Count of custody columns from the rpc waiting to be imported. # HELP beacon_processor_rpc_verify_data_column_queue_total Count of data columns from the rpc waiting to be verified. # HELP data_availability_reconstructed_columns_total Total count of reconstructed columns # HELP data_availability_reconstruction_time_seconds Time taken to reconstruct columns # HELP data_column_sidecar_computation_seconds Time taken to compute data column sidecar, including cells, proofs and inclusion proof # HELP data_column_sidecar_inclusion_proof_verification_seconds Time taken to verify data_column sidecar inclusion proof # HELP kzg_verification_data_column_single_seconds Runtime of single data column kzg verificationWere you looking for something in particular wrt to testing it locally?
I think the main thing to look for when testing peerdas is making sure the network continue to produce blocks and nodes are in sync with each other - this means columns are being produced and propagated and likely works end-to-end. And if we want to make sure it interops with c-kzg, it might be worth running the
network-params-das-interop.yamlone (with prysm and teku).
No worries! I can see you have your hands full :)
I'm working on integrations for a few other repos, so I'll probably have to circle back to collecting information next week on local devnet. This information is useful in general so if you get to it before me, feel free to post it here!
Hi @kevaundray,
Apologies on the delay getting to this. I was out the last couple of days.
I did a quick round of testing locally and it seems to be fine with a Lighthouse only network, but when I ran it with Prysm and Teku, the network forked pretty quickly.
I'll look into it again tomorrow, maybe I'm missing something in my network config.
Hi @kevaundray,
Apologies on the delay getting to this. I was out the last couple of days.
I did a quick round of testing locally and it seems to be fine with a Lighthouse only network, but when I ran it with Prysm and Teku, the network forked pretty quickly.
I'll look into it again tomorrow, maybe I'm missing something in my network config.
No worries :)
I'm guessing the DAS branch without this change is okay?
I’m not entirely sure tbh, didn’t get too much time on this today. I didn’t see any KZG errors so I suspect it may be related to mismatch das network defaults for different clients and we may have to override them in the Kurtosis config, I’ll try again tomorrow.
I've tested again, looking pretty good after I added network config overrides, so I guess clients currently have different default values:
participants:
- cl_type: prysm
cl_image: ethpandaops/prysm-beacon-chain:peerDAS
- cl_type: lighthouse
cl_extra_params: [
--subscribe-all-data-column-subnets,
]
cl_image: lighthouse:local
- cl_type: lighthouse
cl_image: lighthouse:local
- cl_type: teku
cl_image: ethpandaops/teku:nashatyrev-das
# - cl_type: nimbus
# cl_image: ethpandaops/nimbus-eth2:kzgpeerdas
#
# - cl_type: grandine
# cl_image: ethpandaops/grandine:das
#
# - cl_type: lodestar
# cl_image: ethpandaops/lodestar:peerDAS
network_params:
eip7594_fork_epoch: 0
eip7594_fork_version: "0x50000038"
data_column_sidecar_subnset_count: 64
samples_per_slot: 16
custody_requirement: 4
target_number_of_peers: 70
snooper_enabled: false
global_log_level: debug
additional_services:
- dora
- goomy_blob
I'll post some metrics comparison soon
Metrics running this branch, looks pretty good and comparable to das branch.
This branch (over 1 h):
das branch (over 5 m):
@dapplion I've tested the branch and it works pretty well!
I've attached some metrics in the above comment. I've ran out of time to review the new changes, but feel free to merge if you're happy with it, otherwise I'll re-review it when I have a chance.
Thanks a lot for this @kevaundray ! 🙏