lighthouse icon indicating copy to clipboard operation
lighthouse copied to clipboard

chore: add rust PeerdasKZG crypto library for peerdas functionality and rollback c-kzg dependency to 4844 version

Open kevaundray opened this issue 1 year ago • 11 comments

Issue Addressed

This relies on #5938

Below was copied from Dapplion:

Description: Replace c-kzg peerdas crypto with new lib peerdas-kzg Rationale: Do not touch mainnet paths, and use this lib only for das codepaths Justification: Eventually this lib should become canonical, and we can add a switch later when ckzg peerdas merges to main

Proposed Changes

Please list or describe the changes introduced by this PR.

Additional Info

Please provide any additional information. For example, future considerations or information useful for reviewers.

kevaundray avatar Jun 17 '24 18:06 kevaundray

Thanks @kevaundray! Are there any benchmarks for the peerdas-kzg crate?

jimmygchen avatar Jun 19 '24 23:06 jimmygchen

Thanks @kevaundray! Are there any benchmarks for the peerdas-kzg crate?

Yep, I added some a few hours ago -- you can cd eip7594 and cargo bench -- For a single thread, it should have around the same performance as c-kzg -- for computing cells and proofs, I get ~2x when I set NUM_THREADS to 8. I expect this number to be driven down further when more components are parallelized

kevaundray avatar Jun 20 '24 22:06 kevaundray

Nice! Thanks a lot for this @kevaundray

There are a few things we need to do before merging this, as das is the active branch currently being tested on peerdas-devnet-1:

  • Investigate and fix CI failures
  • I'd love to see some local devnet testing before we merge this in (I'm using the config from #5968)
  • It'd be nice to make this either runtime configurable or behind a feature gate, so we can easily switch between libs without blocking devnet testing in case we run into bugs / perf issues.

If you're busy I'd be happy to keep this moving in the next week, also happy to do some testing.

jimmygchen avatar Jun 26 '24 01:06 jimmygchen

Investigate and fix CI failures

Seems my change locally didn't get pushed wrt to renaming the library in the workspace, sorry about that. Fixed the other linter complaints.

It'd be nice to make this either runtime configurable or behind a feature gate, so we can easily switch between libs without blocking devnet testing in case we run into bugs / perf issues.

Note that this means that we would need to point c-kzg back to the das branch, whereas this PR reverts it to its 4844 status so that no 4844 code is affected. Is there a way to check if there are performance issues before merging?

Re bugs, the cryptography libraries have their own set of test vectors, so I'd be suprised if a bug is encountered and the test vectors still pass unless the test vectors are lacking (which is also possible)

kevaundray avatar Jun 26 '24 10:06 kevaundray

I'd love to see some local devnet testing before we merge this in (I'm using the config from https://github.com/sigp/lighthouse/pull/5968)

I ran the local devnet script in that PR you linked. I looked at the logs and see no errors except for some sporadic handshake failures (not familiar with the p2p layer, so not sure if this is normal, though it shouldn't be related to this PR I think)

Also see: [el-1-geth-lighthouse] ERROR[...] Nil finalized block cannot evict old blobs

Were you looking for something in particular wrt to testing it locally?

kevaundray avatar Jun 26 '24 10:06 kevaundray

Although we don't use the proofs that get passed to recover_cells_and_kzg_proofs, we need to pass it to the library as a parameter and perform redundant checks (This should eventually be removed, the consensus-specs PR for it is https://github.com/ethereum/consensus-specs/pull/3819)

kevaundray avatar Jun 26 '24 11:06 kevaundray

Unclear to me why this is passing on the DAS branch since we also don't pass proofs into there and its using the old interface where you recover_cells, then covert cells to blob, then compute proofs.

In order to make it work, we would need reconstruct to also pass along the proof for each cell to recover_cells_and_proofs.

EDIT: It passes on the DAS branch because it is not using the most updated version of c-kzg which has all of the extra checks on the input proof

kevaundray avatar Jun 26 '24 12:06 kevaundray

I've pulled out the API change for recover_cells_and_proofs to #5998

kevaundray avatar Jun 26 '24 13:06 kevaundray

@jimmygchen I've ran the ef-tests locally and make lint -- will leave local devnet testing and other blockers in your hand

kevaundray avatar Jun 27 '24 10:06 kevaundray

Sure, I'll do the testing, thanks a lot for this :relaxed:

Sorry about the delay in responding! and just to answer your questions:

Is there a way to check if there are performance issues before merging?

Yep we do have some kzg metrics exposed via the {host}:{metrics_port}/metrics endpoint, and may be useful to compare them:

> curl -s $(kurtosis port print local-testnet cl-2-lighthouse-geth metrics)/metrics | grep -E "HELP.+column"
# HELP beacon_blobs_column_sidecar_gossip_verification_seconds Full runtime of data column sidecars gossip verification
# HELP beacon_blobs_column_sidecar_processing_requests_total Count of all data column sidecars submitted for processing
# HELP beacon_blobs_column_sidecar_processing_successes_total Number of data column sidecars verified for gossip
# HELP beacon_data_column_gossip_propagation_verification_delay_time Duration between when the data column sidecar is received over gossip and when it is verified for propagation.
# HELP beacon_data_column_gossip_slot_start_delay_time Duration between when the data column sidecar is received over gossip and the start of the slot it belongs to.
# HELP beacon_data_column_last_delay Keeps track of the last data column sidecar's delay from the start of the slot
# HELP beacon_processor_gossip_data_column_queue_total Count of data column sidecars from gossip waiting to be verified.
# HELP beacon_processor_gossip_data_column_verified_total Total number of gossip data column sidecar verified for propagation.
# HELP beacon_processor_rpc_custody_column_queue_total Count of custody columns from the rpc waiting to be imported.
# HELP beacon_processor_rpc_verify_data_column_queue_total Count of data columns from the rpc waiting to be verified.
# HELP data_availability_reconstructed_columns_total Total count of reconstructed columns
# HELP data_availability_reconstruction_time_seconds Time taken to reconstruct columns
# HELP data_column_sidecar_computation_seconds Time taken to compute data column sidecar, including cells, proofs and inclusion proof
# HELP data_column_sidecar_inclusion_proof_verification_seconds Time taken to verify data_column sidecar inclusion proof
# HELP kzg_verification_data_column_single_seconds Runtime of single data column kzg verification

Were you looking for something in particular wrt to testing it locally?

I think the main thing to look for when testing peerdas is making sure the network continue to produce blocks and nodes are in sync with each other - this means columns are being produced and propagated and likely works end-to-end. And if we want to make sure it interops with c-kzg, it might be worth running the network-params-das-interop.yaml one (with prysm and teku).

jimmygchen avatar Jun 28 '24 12:06 jimmygchen

Sure, I'll do the testing, thanks a lot for this :relaxed:

Sorry about the delay in responding! and just to answer your questions:

Is there a way to check if there are performance issues before merging?

Yep we do have some kzg metrics exposed via the {host}:{metrics_port}/metrics endpoint, and may be useful to compare them:


> curl -s $(kurtosis port print local-testnet cl-2-lighthouse-geth metrics)/metrics | grep -E "HELP.+column"

# HELP beacon_blobs_column_sidecar_gossip_verification_seconds Full runtime of data column sidecars gossip verification

# HELP beacon_blobs_column_sidecar_processing_requests_total Count of all data column sidecars submitted for processing

# HELP beacon_blobs_column_sidecar_processing_successes_total Number of data column sidecars verified for gossip

# HELP beacon_data_column_gossip_propagation_verification_delay_time Duration between when the data column sidecar is received over gossip and when it is verified for propagation.

# HELP beacon_data_column_gossip_slot_start_delay_time Duration between when the data column sidecar is received over gossip and the start of the slot it belongs to.

# HELP beacon_data_column_last_delay Keeps track of the last data column sidecar's delay from the start of the slot

# HELP beacon_processor_gossip_data_column_queue_total Count of data column sidecars from gossip waiting to be verified.

# HELP beacon_processor_gossip_data_column_verified_total Total number of gossip data column sidecar verified for propagation.

# HELP beacon_processor_rpc_custody_column_queue_total Count of custody columns from the rpc waiting to be imported.

# HELP beacon_processor_rpc_verify_data_column_queue_total Count of data columns from the rpc waiting to be verified.

# HELP data_availability_reconstructed_columns_total Total count of reconstructed columns

# HELP data_availability_reconstruction_time_seconds Time taken to reconstruct columns

# HELP data_column_sidecar_computation_seconds Time taken to compute data column sidecar, including cells, proofs and inclusion proof

# HELP data_column_sidecar_inclusion_proof_verification_seconds Time taken to verify data_column sidecar inclusion proof

# HELP kzg_verification_data_column_single_seconds Runtime of single data column kzg verification

Were you looking for something in particular wrt to testing it locally?

I think the main thing to look for when testing peerdas is making sure the network continue to produce blocks and nodes are in sync with each other - this means columns are being produced and propagated and likely works end-to-end. And if we want to make sure it interops with c-kzg, it might be worth running the network-params-das-interop.yaml one (with prysm and teku).

No worries! I can see you have your hands full :)

I'm working on integrations for a few other repos, so I'll probably have to circle back to collecting information next week on local devnet. This information is useful in general so if you get to it before me, feel free to post it here!

kevaundray avatar Jun 28 '24 13:06 kevaundray

Hi @kevaundray,

Apologies on the delay getting to this. I was out the last couple of days.

I did a quick round of testing locally and it seems to be fine with a Lighthouse only network, but when I ran it with Prysm and Teku, the network forked pretty quickly.

I'll look into it again tomorrow, maybe I'm missing something in my network config.

jimmygchen avatar Jul 08 '24 08:07 jimmygchen

Hi @kevaundray,

Apologies on the delay getting to this. I was out the last couple of days.

I did a quick round of testing locally and it seems to be fine with a Lighthouse only network, but when I ran it with Prysm and Teku, the network forked pretty quickly.

I'll look into it again tomorrow, maybe I'm missing something in my network config.

No worries :)

I'm guessing the DAS branch without this change is okay?

kevaundray avatar Jul 08 '24 08:07 kevaundray

I’m not entirely sure tbh, didn’t get too much time on this today. I didn’t see any KZG errors so I suspect it may be related to mismatch das network defaults for different clients and we may have to override them in the Kurtosis config, I’ll try again tomorrow.

jimmygchen avatar Jul 08 '24 09:07 jimmygchen

I've tested again, looking pretty good after I added network config overrides, so I guess clients currently have different default values:

participants:
  - cl_type: prysm
    cl_image: ethpandaops/prysm-beacon-chain:peerDAS

  - cl_type: lighthouse
    cl_extra_params: [
      --subscribe-all-data-column-subnets,
    ]
    cl_image: lighthouse:local

  - cl_type: lighthouse
    cl_image: lighthouse:local

  - cl_type: teku
    cl_image: ethpandaops/teku:nashatyrev-das

#  - cl_type: nimbus
#    cl_image: ethpandaops/nimbus-eth2:kzgpeerdas
#
#  - cl_type: grandine
#    cl_image: ethpandaops/grandine:das
#
#   - cl_type: lodestar
#     cl_image: ethpandaops/lodestar:peerDAS
network_params:
  eip7594_fork_epoch: 0
  eip7594_fork_version: "0x50000038"
  data_column_sidecar_subnset_count: 64
  samples_per_slot: 16
  custody_requirement: 4
  target_number_of_peers: 70
snooper_enabled: false
global_log_level: debug
additional_services:
  - dora
  - goomy_blob

I'll post some metrics comparison soon

jimmygchen avatar Jul 09 '24 06:07 jimmygchen

Metrics running this branch, looks pretty good and comparable to das branch.

This branch (over 1 h): image

das branch (over 5 m): image

jimmygchen avatar Jul 09 '24 08:07 jimmygchen

@dapplion I've tested the branch and it works pretty well!

I've attached some metrics in the above comment. I've ran out of time to review the new changes, but feel free to merge if you're happy with it, otherwise I'll re-review it when I have a chance.

Thanks a lot for this @kevaundray ! 🙏

jimmygchen avatar Jul 09 '24 08:07 jimmygchen