lighthouse
lighthouse copied to clipboard
Add cache for selection proof signatures
Issue Addressed
#3216
Proposed Changes
Add a cache to each VC for selection proofs that have recently been computed. This ensures that duplicate selection proofs are processed quickly.
Additional Info
Maximum cache size is 64 signatures, at which point the oldest signature (or rather the signature referencing the oldest message) is removed to make room for the next signature.
WIP: Still need to remove pre-computation selection proofs if these are no longer desired.
Thinking on it some more I think it may be useful to keep the pre-computed sync selection proofs, because they also have the advantage of happening in the background off the hot path of trying to sign and publish a selection proof.
Good work guys,
this feature got massive improvement in memory footprint
compared to v2.2.1
@jmcruz1983 what about the total number of signing requests?
We're curious whether there are fewer signing requests with this PR, and whether the increases you were seeing previously were evenly spaced or due to big spikes.
@michaelsproul
numer of signing request didn't change much, we are running on similar numbers
Check before screenshot:
and after screenshot:

Everything is working with some decrease on Web3SignerRequestFailed errors
Thanks
Thanks for the info @jmcruz1983!
How many validators are on the machine from which that graph was drawn?
You mentioned previously that Lighthouse was doing a lot more signing than Teku, would you be able to send a similar graph for Teku? Or let us know Teku's average number of signatures per validator per epoch so we can compare.
Thanks! :pray:
@michaelsproul Chart combines signing ops from tekus & lighthouse validators, in this chart, from our test environment, we are having around 4k active keys
I managed to isolate lighthouse signing request and number of requests before & after are not significant:
Here we are talking of 2 validators (each one with around 800 active keys)
Any idea when this PR will be released? Thanks
@jmcruz1983 has the memory usage remained lower even after running this PR for a few days? We haven't been able to reproduce that on our nodes and don't know why that would occur
@michaelsproul
yes memory keeps lower since upgrade

Any update here?
@jmcruz1983 Would you mind trying the VC from v2.3.1 on your test system? I'm wondering if maybe the memory reduction was due to upgrading from v2.2.1
Hi @michaelsproul
I did the v2.3.1 upgrade and memory increased back to the previous level.
So from my metrics I can observe that memory reduction is due to VC.

@jmcruz1983 Thanks again!
I'd like to understand if there are any other differences between the binaries besides the version. My guess is that the memory difference might be due to the optimisation level used to compile our core crypto library (blst).
Are you using the Docker images from DockerHub for v2.3.1? Do you use the v2.3.1-modern tag, or just v2.3.1?
When you compile the Docker image yourself for this branch, I guess you're using a recently-released CPU? In which case you'll get the equivalent of v2.3.1-modern.
As a test, would you mind trying v2.3.1-modern on your infra?
Hi @michaelsproul
yes, I am using images from dockerHub with modern flavour,
So still VC was having lower memory footprint than v2.3.1-modern.
It's a real mystery! Do you mind trying the unoptimised v2.3.1 tag instead then?
(edit: cc @jmcruz1983)
@michaelsproul
With unoptimised v2.3.1 memory footprint it is same:

@jmcruz1983 Thanks! Can you please try building an image manually for the stable branch? Following the same process you use to build this PR
@michaelsproul
I did upgrade the validator with the custom build from stable branch and memory consumption did decrease:

FYI, this is the docker file I use to build the custom builds:
FROM library/rust:1.58.1-bullseye AS builder
RUN apt-get update && apt-get -y upgrade && apt-get install -y cmake libclang-dev
# Fetch specific https://github.com/sigp/lighthouse/pull/3223
# RUN git clone --quiet --depth 1 --branch vc-sig-cache https://github.com/macladson/lighthouse.git
# Fetch main stable branc
RUN git clone --quiet --depth 1 --branch stable https://github.com/sigp/lighthouse.git
ARG FEATURES
ENV FEATURES $FEATURES
RUN cd lighthouse && make
FROM ubuntu:latest
RUN apt-get clean \
&& rm -rf /var/lib/apt/lists/*
COPY --from=builder /usr/local/cargo/bin/lighthouse /usr/local/bin/lighthouse
# Adding non-root user
RUN adduser --disabled-password --gecos "" --home /opt/lighthouse lighthouse
RUN chown lighthouse:lighthouse /opt/lighthouse
# USER lighthouse
WORKDIR /opt/lighthouse
Hi @jmcruz1983, I've opened a PR to help us debug the issue here: https://github.com/sigp/lighthouse/pull/3279. Once that lands in unstable we'll be able to compare the SigP-compiled :latest-unstable image against a manual build of unstable, by observing the malloc metrics. We have a handy Grafana dashboard for these metrics here: https://github.com/sigp/lighthouse-metrics/blob/master/dashboards/MallocInfoGNU.json
@jmcruz1983 the metrics have landed in unstable if you'd like to test manually-compiled unstable vs :latest-unstable :pray:
@michaelsproul
I just set the dashboard and now I am running latest-unstable
so what specifics in the dashboard you want me check?
@jmcruz1983 If my suspicions are correct, I expect a large fraction of your VC's memory usage will be freed memory that hasn't been reclaimed, which should show up under Free block memory. This is one of our Prater nodes with 5K validators:

The other main ones to keep an eye on are Mem-mapped bytes (which I think should be low) and Non-mmap memory (which should be the bulk of useful memory). We can compare these 3 counts (mmap, non-mmap, free) to the total as reported by the dash, or to htop's resident set size (which might be slightly larger).
@michaelsproul
I did run latest-unstable and after that I upgraded to custom build from unstable branch and I capture some screenshots from Malloc Info Dashboard:
- Blue line belongs to
latest-unstablebuild pulled from docker - Orange line belongs to custom build from
unstablebranch
You can observer that custom build shows a better memory footprint.

Thanks @jmcruz1983. We'll do some more investigations on our end to see what could be using so much memory when the VC uses web3signer.
As requested by @paulhauner I took a snapshot on vc_signature_cache_hits metrics
I was able to reproduce the memory issues locally and have written up an issue for it at #3302. Thanks for your help in discovering this @jmcruz1983 :pray:
Going to close this for now, seems like we get little to no benefit from this particular implementation. If these issues come up again, we could try investigating different avenues.