lighthouse icon indicating copy to clipboard operation
lighthouse copied to clipboard

single_block_lookups leak

Open dapplion opened this issue 1 year ago • 1 comments

Description

Looking at the metric sync_single_block_lookups on our nodes, they have 100k ~ 150k active lookups. The metric is properly implemented so this is a leak. Each lookup is quite small some hundreds of bytes so the leak is very slow and small overall.

A possible explanation is:

  • Create a new lookup for block A
  • Block A is already in the da_checker
  • lookup skips sending a block request because it's already in the da_checker
  • No need event for lookup is received, so it is never removed

Version

stable

Steps to resolve

Fixed with

  • https://github.com/sigp/lighthouse/pull/5583
  • https://github.com/sigp/lighthouse/pull/5681

dapplion avatar May 02 '24 16:05 dapplion

Seems like the leak is happening atleast partly due to https://github.com/sigp/lighthouse/pull/5680#issuecomment-2092577327

RPCError::Disconnect not propagating up to sync could lead to awaiting_parent.is_some() lookups never getting resolved which means that they never get removed from the lookups map.

I did some testing with propagating the disconnects to sync. Doing this seems to result in lookups getting removed and sync_single_block_lookups metric getting back to 0 once the node is synced. Not propagating the disconnects (like its happening currently in cut-5.2.0) is consistently increasing the lookup size on local testing.

pawanjay176 avatar May 05 '24 01:05 pawanjay176