Useful metrics
Operationally, there are some obvious things to measure per flow node. These should be exposed via /metrics if they aren't already:
DB connectivity:
- number of active pool connections (vs. idle)
- sql span histograms for journalling
One upper limit on how many concurrent stage operations we can sustain per second is (max pool connections) / <sql query span>.
Executor connectivity:
- number of active fn invocations the executor is waiting on.
Error counts:
- fn failures
- db errors
- lower-level errors: eg, socket availability (we might conceivably bump into this if we have a naive http/1.1 connection to the fn api).
For DB connectivity we have spans of the time taken by sql persistence operations (by operation). These are then collected in histogram by the prometheus mapper. I would argue that we also need quantiles (therefore use prometheus Summaries instead of Histograms), so I can add those.
Counters (number of connections, number of fn invocations, ...) are supported in prometheus but I'm not sure if opentracing has a concept for those (still, I'm a bit ignorant when it comes to opentracing).
I don't mind using raw prometheus (or something wrapped around it) if it means we can get counters out for useful things.
I don't think opentracing concerns itself with metrics/gauge stuff - and retconning numbers from the events is a bad idea, I assume we'll need to generate propmetheus metrics from internal gauge/counters alongside the event metrics.
Note: https://github.com/fnproject/flow/pull/114 adds a few of the mentioned metrics.
- DB timings (already there)
- API call timings
- Number of currently active Flows
- Number of currently active Fn invocations
- Duration of individual flows (aggregated as histogram / quantiles)
#114 is closed now, because it will be done as part of #84 since that changes the api.