lighthouse
lighthouse copied to clipboard
Audit prometheus histogram buckets
Description
Presently our metrics always use the default buckets for histograms, which work OK for times of a few milliseconds to a second, but not so well for shorter/longer times, or large integer values.
I propose we:
- Introduce a new primitive in the
lighthouse_metricscrate for defining histograms together with their buckets. The underlying machinery already exists in theprometheuscrate, we just aren't using it: https://docs.rs/prometheus/0.13.1/prometheus/struct.HistogramOpts.html - Use this new primitive to overhaul existing histograms for which the buckets are poorly sized.
Metrics in need of change
beacon_block_total_size(thanks @dapplion for flagging this)
Version
Lighthouse v2.3.1
Buckets of whole numbers, lowest bucket should be 1
- beacon_operations_per_block_attestation_total_bucket
Misc:
- beacon_block_total_size_bucket: bucket size should range avg block size
This buckets could be proportional to slot time, and extend beyond 1x SECONDS_PER_SLOT to capture really bad network conditions
- beacon_block_gossip_propagation_verification_delay_time
- beacon_block_gossip_slot_start_delay_time
- beacon_block_head_imported_delay_time
- beacon_block_head_slot_start_delay_time
- beacon_block_imported_observed_delay_time
- beacon_block_observed_slot_start_delay_time