veneur icon indicating copy to clipboard operation
veneur copied to clipboard

Move runtime metrics reporting to a separate timer routine.

Open cory-stripe opened this issue 7 years ago • 5 comments

Summary

Move runtime metric reporting to a separate ticker and goroutine.

Motivation

At present our emission of runtime metrics — gc, heap etc — is tied directly to the flusher. This seems like a bad mix of concerns. It also means that our runtime metrics are tied directly to our flush "tick". In a future where this is adjustable or non-existent we need a plan!

Notes

I also added the current goroutine count, which seems like good information in case we leak goroutines.

Test plan

Existing tests. We don't test these now and this doesn't change that.

r? @stripe/observability

cory-stripe avatar Oct 19 '17 15:10 cory-stripe

Gerald Rule: Copy Observability on Veneur and Unilog pull requests

cc @stripe/observability cc @stripe/observability-stripe

stripe-ci avatar Oct 19 '17 15:10 stripe-ci

r? @stripe/observability

I don't think I have the state + focus to do this during /dev/start, sorry!

an-stripe avatar Oct 19 '17 18:10 an-stripe

r? @stripe/observability

an-stripe avatar Oct 19 '17 18:10 an-stripe

Hey @aditya-stripe, this one is ready. Note that the intervals for this and the flush are ~synchronized and it should do what you suggested in a separate PR with regards to being at a high water mark. I would argue that it's important it be separated so as not to just give us the skewed, flush-time numbers. Having a separate ticker without the "bucketing" behavior should give us something that has a bit of periodicity to it which I think is an improvement.

cory-stripe avatar Oct 20 '17 22:10 cory-stripe

At the moment, we're collecting these metrics immediately after each flush is initiated. This may or may not actually be the exact peak (or nadir) for metrics like memory usage, because of timing jitter as goroutines are scheduled. Though it is likely to be close, and at the very least, highly-correlated with the higher end of the distribution.

In reality, what we want is a measurement of the full range of the distribution, because these will change over the span of the 10-second flush cycle. We'll want to choose something that's less than 5 seconds, like 3s or 4s.

And if we're emitting these metrics multiple times per flush cycle, we'll also want to change them to be histograms, not gauges.

So, lgtm for now, but let's file a ticket to update this as well.

ChimeraCoder avatar Oct 23 '17 20:10 ChimeraCoder