lighthouse Deterministic subnet peer discovery

Issue Addressed

https://github.com/sigp/lighthouse/issues/3648

Proposed Changes

Added a CLI param --prefix-search-for-subnet to enable prefix search.
- Running Lighthouse without the parameter behaves as before, defaulting to searching on a random id.
Added PrefixMapping that stores mappings of SubnetId to NodeIds.
- Discovery calls PrefixMapping::get_target_nodes to get target NodeIds when run prefix searching.

Additional Info

This PR is based on #4959, which is Diva's branch, so I think we need to merge #4959 first.

I've measured the improvement in (attestation subnet) peer discovery brought about by this PR, based on the sent/received bytes and the number of nodes found during the discovery process.

# Random search
$ cargo run --bin lighthouse -- -l bn --enr-address {public ip address}
# Prefix search
$ cargo run --bin lighthouse -- -l bn --enr-address {public ip address} --prefix-search-for-subnet

I've run lighthouse five times each for Random search and Prefix search, observing the sent/received bytes for discovery and the number of nodes found. Based on the results below, the discovery process with Prefix search has improved by approximately 3x compared to Random search.

Note for self: Here are the changes used for the measurement.

Jan 19 '24 05:01 ackintosh

Note: some of the conflicts come from the base branch https://github.com/sigp/lighthouse/pull/4959.

Jan 19 '24 05:01 ackintosh

Hey @ackintosh - Nice work. We've merged the base branch now, but I have some questions about your results.

I don't fully understand the measurements you are making. Could you explain a bit further how the tests are done.

Are you doing discovery queries? Is that what the # number represents? Each query should only return at most 16 peers, but it looks like in the prefix search its returning more?

Are runs between the prefix and non-prefix actually separate. When lighthouse shuts down, it retains the peers it found in its db. So if you start a fresh lighthouse node and run a bunch of searchers, then restart it and run the same bunch, the second run will be faster because it has already seeded it's local routing table.

I'm asking this because I expect that the prefix-search will actually have no benefit at the moment. We should only see improvements as nodes start to use the new deterministic subnet scheme. Because pretty much all nodes on the network haven't got this yet, the prefix search should be equivalent to a normal search.

Jan 31 '24 09:01 AgeManning

Are runs between the prefix and non-prefix actually separate. When lighthouse shuts down, it retains the peers it found in its db. So if you start a fresh lighthouse node and run a bunch of searchers, then restart it and run the same bunch, the second run will be faster because it has already seeded it's local routing table.

Oh, I didn't know that the peers are retained in the db upon shutdown. The first measurement I conducted might be incorrect, as I didn't clear the db before running the tests.

I've updated the measurement procedure and conducted a second measurement.

Preparation

Create a branch for the measurement based on this PR. The changes in this branch include:

Run queries only for Subnet::Attestation (skipping queries for Subnet::SyncCommittee and QueryType::FindPeers). This aims to obtain accurate metrics for sent/recv bytes specifically for the Subnet::SyncCommittee query.
Logging sent/recv bytes and the number of found nodes.
- Here, "found nodes" refers to nodes subscribing to the specific subnet we queried.

Measurement procedure

I followed the procedure below, performing it 5 times each for both normal search and prefix search:

Clear datadir to start a fresh lighthouse node on each run.
Start lighthouse with (or without) the --prefix-search-for-subnet option.
Stop lighthouse after 60 seconds.
Check the metrics logs added in the Preparation section.

Result

The result shows that subnet peer discovery has improved approximately 7x with Prefix search (see efficiency in the result table).

Note:

#: The test was run 5 times each for both normal search and prefix search, so # ranges from 1 to 5.
bytes_sent / bytes_recv: The bytes sent and received by queries during the 60-second lighthouse run.
total found nodes: The number of nodes subscribing to the specific subnet and found by the queries during the lighthouse run.
efficiency: Calculated as total bytes / found nodes. This is used to compare the two search methods. A lower number indicates better efficiency, meaning lower bandwith consumption to search for nodes.

Why the result indicates better efficiency at this time

I'm asking this because I expect that the prefix-search will actually have no benefit at the moment. We should only see improvements as nodes start to use the new deterministic subnet scheme. Because pretty much all nodes on the network haven't got this yet, the prefix search should be equivalent to a normal search.

The subnet-id computed by the new deterministic subnet scheme is relatively close to one computed by the current subnet scheme. We implemented the new subnet scheme in #4959, where expected_subnets in the unit test was partially updated. I believe this is why we see improvements with the prefix search at this time.

If anything is unclear, please let me know. 🙏

Apr 09 '24 14:04 ackintosh

Hey @ackintosh, nice work.

I still can't understand why we are seeing any improvements at all. If anything I think this indicates a bug in our new code or something wrong with the testing methodology.

Isn't the only difference that we are no longer searching for a random node-id, but we prefix-it with some bits. In either case the node's are uniformly distributed and searching via a random node-id or prefixing a few bits should yield the exact same results.

It doesn't make sense to me why there is such a difference between the two?

Apr 15 '24 00:04 AgeManning

@AgeManning As you mentioned, nodes are using the current subnet scheme (not the new scheme) as defined in the spec. I did another test to see how the computed subnet_ids from the current and the new scheme match, with 10,000 random node_id and epoch pairs.

Approximately 67% of the subnet_ids matched. (Details of the test are below.)

If the computed subnet_ids were completely different, we would likely see no improvements. However, about 67% of computed subnet_ids were the same. I think this is a reason why we are seeing the improvements by prefix search with the new subnet scheme.

If there is any misunderstanding on my part, I would appreciate it if you could point it out. 🙇

Here is the test code used for the test. Most of the code is copied from the consensus-specs repository.

# ################
# p2p-interface.md
# ################

# Current scheme, copied from current `p2p-interface.md`.
def compute_subscribed_subnet_by_current_schema(node_id: NodeID, epoch: Epoch, index: int) -> SubnetID:
    node_id_prefix = node_id >> (NODE_ID_BITS - ATTESTATION_SUBNET_PREFIX_BITS)
    node_offset = node_id % EPOCHS_PER_SUBNET_SUBSCRIPTION
    permutation_seed = hash(uint_to_bytes(uint64((epoch + node_offset) // EPOCHS_PER_SUBNET_SUBSCRIPTION)))
    permutated_prefix = compute_shuffled_index(
        node_id_prefix,
        1 << ATTESTATION_SUBNET_PREFIX_BITS,
        permutation_seed,
    )
    return SubnetID((permutated_prefix + index) % ATTESTATION_SUBNET_COUNT)

# New scheme, copied from https://github.com/ethereum/consensus-specs/pull/3545
def compute_subscribed_subnet_by_new_schema(node_id: NodeID, epoch: Epoch, index: int) -> SubnetID:
    subnet_prefix = node_id >> (NODE_ID_BITS - ATTESTATION_SUBNET_PREFIX_BITS)
    shuffling_bit_size = (
        NODE_ID_BITS
        - ATTESTATION_SUBNET_PREFIX_BITS
        - ATTESTATION_SUBNET_SHUFFLING_PREFIX_BITS
    )
    shuffling_bits = (node_id >> shuffling_bit_size) % (1 << ATTESTATION_SUBNET_SHUFFLING_PREFIX_BITS)
    shuffling_multiplier = EPOCHS_PER_SUBNET_SUBSCRIPTION >> ATTESTATION_SUBNET_SHUFFLING_PREFIX_BITS
    epoch_transition = (
        (subnet_prefix + (shuffling_bits * shuffling_multiplier)) % EPOCHS_PER_SUBNET_SUBSCRIPTION
    )
    permutation_seed = hash(uint_to_bytes(uint64((epoch + epoch_transition) // EPOCHS_PER_SUBNET_SUBSCRIPTION)))
    permutated_prefix = compute_shuffled_index(
        subnet_prefix,
        1 << ATTESTATION_SUBNET_PREFIX_BITS,
        permutation_seed,
    )
    return SubnetID((permutated_prefix + index) % ATTESTATION_SUBNET_COUNT)

# ################
# test_validator_unittest.py
# ################

@with_phases([PHASE0])
@spec_test
@single_phase
def test_compute_subscribed_subnets_similarity(spec):
    for i in range(10000):
        rng = random.Random(i)
        res = check_if_computed_subnets_are_equal_between_current_and_new(spec, rng)
        print('res', res)

def check_if_computed_subnets_are_equal_between_current_and_new(spec, rng):
    node_id = rng.randint(0, 2**256 - 1)
    epoch = rng.randint(0, 2**64 - 1)

    subnets_current = spec.compute_subscribed_subnets_by_current_schema(node_id, epoch)
    subnets_new = spec.compute_subscribed_subnets_by_new_schema(node_id, epoch)

    return subnets_current == subnets_new

I ran the test in the consensus-specs repo as shown below:

cd consensus-specs/tests/core/pyspec

python \
  -s \
  -m pytest \
  --preset=mainnet \
  -k test_compute_subscribed_subnets_similarity \
  eth2spec/test/phase0/unittests/validator/test_validator_unittest.py

Apr 18 '24 13:04 ackintosh