lighthouse
lighthouse copied to clipboard
single_block_lookups leak
Description
Looking at the metric sync_single_block_lookups on our nodes, they have 100k ~ 150k active lookups. The metric is properly implemented so this is a leak. Each lookup is quite small some hundreds of bytes so the leak is very slow and small overall.
A possible explanation is:
- Create a new lookup for block A
- Block A is already in the da_checker
- lookup skips sending a block request because it's already in the da_checker
- No need event for lookup is received, so it is never removed
Version
stable
Steps to resolve
Fixed with
- https://github.com/sigp/lighthouse/pull/5583
- https://github.com/sigp/lighthouse/pull/5681
Seems like the leak is happening atleast partly due to https://github.com/sigp/lighthouse/pull/5680#issuecomment-2092577327
RPCError::Disconnect not propagating up to sync could lead to awaiting_parent.is_some() lookups never getting resolved which means that they never get removed from the lookups map.
I did some testing with propagating the disconnects to sync. Doing this seems to result in lookups getting removed and sync_single_block_lookups metric getting back to 0 once the node is synced.
Not propagating the disconnects (like its happening currently in cut-5.2.0) is consistently increasing the lookup size on local testing.