lighthouse icon indicating copy to clipboard operation
lighthouse copied to clipboard

Audit prometheus histogram buckets

Open michaelsproul opened this issue 3 years ago • 1 comments

Description

Presently our metrics always use the default buckets for histograms, which work OK for times of a few milliseconds to a second, but not so well for shorter/longer times, or large integer values.

I propose we:

  • Introduce a new primitive in the lighthouse_metrics crate for defining histograms together with their buckets. The underlying machinery already exists in the prometheus crate, we just aren't using it: https://docs.rs/prometheus/0.13.1/prometheus/struct.HistogramOpts.html
  • Use this new primitive to overhaul existing histograms for which the buckets are poorly sized.

Metrics in need of change

  • beacon_block_total_size (thanks @dapplion for flagging this)

Version

Lighthouse v2.3.1

michaelsproul avatar Jun 25 '22 12:06 michaelsproul

Buckets of whole numbers, lowest bucket should be 1

  • beacon_operations_per_block_attestation_total_bucket

Misc:

  • beacon_block_total_size_bucket: bucket size should range avg block size

This buckets could be proportional to slot time, and extend beyond 1x SECONDS_PER_SLOT to capture really bad network conditions

  • beacon_block_gossip_propagation_verification_delay_time
  • beacon_block_gossip_slot_start_delay_time
  • beacon_block_head_imported_delay_time
  • beacon_block_head_slot_start_delay_time
  • beacon_block_imported_observed_delay_time
  • beacon_block_observed_slot_start_delay_time

dapplion avatar Jun 25 '22 12:06 dapplion