flow Useful metrics

Operationally, there are some obvious things to measure per flow node. These should be exposed via /metrics if they aren't already:

DB connectivity:

number of active pool connections (vs. idle)
sql span histograms for journalling

One upper limit on how many concurrent stage operations we can sustain per second is (max pool connections) / <sql query span>.

Executor connectivity:

number of active fn invocations the executor is waiting on.

Error counts:

fn failures
db errors
lower-level errors: eg, socket availability (we might conceivably bump into this if we have a naive http/1.1 connection to the fn api).

Nov 09 '17 13:11 jan-g

For DB connectivity we have spans of the time taken by sql persistence operations (by operation). These are then collected in histogram by the prometheus mapper. I would argue that we also need quantiles (therefore use prometheus Summaries instead of Histograms), so I can add those.

Counters (number of connections, number of fn invocations, ...) are supported in prometheus but I'm not sure if opentracing has a concept for those (still, I'm a bit ignorant when it comes to opentracing).

Nov 09 '17 13:11 hhexo

I don't mind using raw prometheus (or something wrapped around it) if it means we can get counters out for useful things.

Nov 10 '17 09:11 jan-g

I don't think opentracing concerns itself with metrics/gauge stuff - and retconning numbers from the events is a bad idea, I assume we'll need to generate propmetheus metrics from internal gauge/counters alongside the event metrics.

Nov 10 '17 12:11 zootalures

Note: https://github.com/fnproject/flow/pull/114 adds a few of the mentioned metrics.

DB timings (already there)
API call timings
Number of currently active Flows
Number of currently active Fn invocations
Duration of individual flows (aggregated as histogram / quantiles)

Nov 20 '17 18:11 hhexo

#114 is closed now, because it will be done as part of #84 since that changes the api.

Nov 21 '17 16:11 hhexo