flow icon indicating copy to clipboard operation
flow copied to clipboard

Useful metrics

Open jan-g opened this issue 8 years ago • 5 comments

Operationally, there are some obvious things to measure per flow node. These should be exposed via /metrics if they aren't already:

DB connectivity:

  • number of active pool connections (vs. idle)
  • sql span histograms for journalling

One upper limit on how many concurrent stage operations we can sustain per second is (max pool connections) / <sql query span>.

Executor connectivity:

  • number of active fn invocations the executor is waiting on.

Error counts:

  • fn failures
  • db errors
  • lower-level errors: eg, socket availability (we might conceivably bump into this if we have a naive http/1.1 connection to the fn api).

jan-g avatar Nov 09 '17 13:11 jan-g

For DB connectivity we have spans of the time taken by sql persistence operations (by operation). These are then collected in histogram by the prometheus mapper. I would argue that we also need quantiles (therefore use prometheus Summaries instead of Histograms), so I can add those.

Counters (number of connections, number of fn invocations, ...) are supported in prometheus but I'm not sure if opentracing has a concept for those (still, I'm a bit ignorant when it comes to opentracing).

hhexo avatar Nov 09 '17 13:11 hhexo

I don't mind using raw prometheus (or something wrapped around it) if it means we can get counters out for useful things.

jan-g avatar Nov 10 '17 09:11 jan-g

I don't think opentracing concerns itself with metrics/gauge stuff - and retconning numbers from the events is a bad idea, I assume we'll need to generate propmetheus metrics from internal gauge/counters alongside the event metrics.

zootalures avatar Nov 10 '17 12:11 zootalures

Note: https://github.com/fnproject/flow/pull/114 adds a few of the mentioned metrics.

  • DB timings (already there)
  • API call timings
  • Number of currently active Flows
  • Number of currently active Fn invocations
  • Duration of individual flows (aggregated as histogram / quantiles)

hhexo avatar Nov 20 '17 18:11 hhexo

#114 is closed now, because it will be done as part of #84 since that changes the api.

hhexo avatar Nov 21 '17 16:11 hhexo