optimism Add prometheus metrics collection to cannon

We need to collect richer metrics for threading related behavior. Including steps between ll/sc instructions, time spent between context switches, etc. These metrics are collected during a VM run and it'll be ideal to ship them over to prometheus as soon as they're collected.

The alternative is to keep these metrics in memory and write them out to DebugInfo. However, this may create really large debug files for the op-challenger to ingest.

Sep 23 '24 17:09 Inphi

Prometheus isn't very good at pulling metrics from short lived processes like cannon. The normal pull model assumes that it can just periodically request the metrics from a long running server. There is a push-gateway to allow things like batch jobs push metrics when they do run and cannon could use that but I'm not sure what support we have for that in grafana cloud. Ultimately that's why DebugInfo was introduced to report the memory usage.

Sep 24 '24 01:09 ajsutton

What about OpenTelemetry, it has advantage on push model and could be easy to integrate Prometheus.

Sep 25 '24 04:09 GrapeBaBa

Good points on the short-life of cannon. We can push the metrics to influx. Which is already integrated with grafana cloud. It's not exactly as open as prometheus but it gets the job done.

Sep 25 '24 16:09 Inphi

Discussed offline w/ mbaxter on some interesting metrics to collect:

maxStepsBetweenLLAndSC - maximum steps between LL and SC.
numReservationInvalidations - number of invalid reservations (failed SCs).
numForcedPreemptions - number of forced preemptions. That is when a thread consumes its SCHED_QUANTUM budget.
numWakeupTraversalFail - Number of times when no thread was found waiting on a futex from wakeup traversal.
numStepsWhileIdle - Total number of steps a live thread was left on an idle state.
- This metric can be limited to the main goroutine, which I believe runs on thread 0. It'll be useful to compare this value with the total number of steps to generate a trace as it's a measure of the overhead of the go runtime.

Oct 08 '24 18:10 Inphi

Also worth noting that all of these metrics can be easily accumulated and pushed after Cannon finishes running. So we can just add these metrics to DebugInfo and record them through prometheus metrics.

Oct 15 '24 14:10 mbaxter

optimism optimism copied to clipboard

Add prometheus metrics collection to cannon

optimism
optimism copied to clipboard