optimism
optimism copied to clipboard
Add prometheus metrics collection to cannon
We need to collect richer metrics for threading related behavior. Including steps between ll/sc instructions, time spent between context switches, etc. These metrics are collected during a VM run and it'll be ideal to ship them over to prometheus as soon as they're collected.
The alternative is to keep these metrics in memory and write them out to DebugInfo. However, this may create really large debug files for the op-challenger to ingest.
Prometheus isn't very good at pulling metrics from short lived processes like cannon. The normal pull model assumes that it can just periodically request the metrics from a long running server. There is a push-gateway to allow things like batch jobs push metrics when they do run and cannon could use that but I'm not sure what support we have for that in grafana cloud. Ultimately that's why DebugInfo was introduced to report the memory usage.
What about OpenTelemetry, it has advantage on push model and could be easy to integrate Prometheus.
Good points on the short-life of cannon. We can push the metrics to influx. Which is already integrated with grafana cloud. It's not exactly as open as prometheus but it gets the job done.
Discussed offline w/ mbaxter on some interesting metrics to collect:
- maxStepsBetweenLLAndSC - maximum steps between LL and SC.
- numReservationInvalidations - number of invalid reservations (failed SCs).
- numForcedPreemptions - number of forced preemptions. That is when a thread consumes its
SCHED_QUANTUMbudget. - numWakeupTraversalFail - Number of times when no thread was found waiting on a futex from wakeup traversal.
- numStepsWhileIdle - Total number of steps a live thread was left on an idle state.
- This metric can be limited to the main goroutine, which I believe runs on thread 0. It'll be useful to compare this value with the total number of steps to generate a trace as it's a measure of the overhead of the go runtime.
Also worth noting that all of these metrics can be easily accumulated and pushed after Cannon finishes running. So we can just add these metrics to DebugInfo and record them through prometheus metrics.