goss Implement a prometheus output for test-metrics scraping

Implement a prometheus output for test-metrics scraping

Open petemounce opened this issue 3 years ago • 43 comments

Checklist

[x] make test-all (UNIX) passes. CI will also test this
[x] unit and/or integration tests are included (if applicable)
[x] documentation is changed or added (if applicable)

Description of change

Fixes #175, #362, #363.

Adds a prometheus output format. I want to be able to scrape my goss serve endpoints so that I can

get metrics for the overall suite-run duration (summary equivalent)
- ... so I can adjust the timeout my healthchecks have against cloud instances so the timeout is comfortably larger than the goss run duration
- ... so I can see overall outcomes over time
get outcome metrics
- ... so I can alert on fails based on when they start to appear, triggering an investigation into what caused my state to drift from "good"
  - note - it is an anti-goal to alert on specific assertion IDs because of the metrics cardinality that would result in. See description below and discussion inside comments
- ... so I can understand on a more granular level whether there are duration fluctuations or outcome spikes across assertion types

This records

overall test-suite run outcome & duration
test outcomes
test durations
... where both test outcomes and durations can be grouped by outcome and assertion-type.

Sample (from unit-test log):

# TYPE goss_tests_outcomes_duration_seconds counter
goss_tests_outcomes_duration_seconds{outcome="fail",type="command"} 0.01
goss_tests_outcomes_duration_seconds{outcome="pass",type="command"} 0.02
goss_tests_outcomes_duration_seconds{outcome="skip",type="file"} 0.01
# HELP goss_tests_outcomes_total The number of test-outcomes from this run.
# TYPE goss_tests_outcomes_total counter
goss_tests_outcomes_total{outcome="fail",type="command"} 1
goss_tests_outcomes_total{outcome="pass",type="command"} 2
goss_tests_outcomes_total{outcome="skip",type="file"} 1
# HELP goss_tests_run_duration_seconds The end-to-end duration of this run.
# TYPE goss_tests_run_duration_seconds counter
goss_tests_run_duration_seconds 60

Implementation details and notes

I expect to be using this most in the context of goss serve --format prometheus, so that I can configure my prometheus servers to scrape servers that I'm running goss endpoints on. Therefore https://prometheus.io/docs/practices/instrumentation/#online-serving-systems was the most appropriate model to choose.

I followed the advice on metric and label naming (https://prometheus.io/docs/practices/naming/) (~~er, actually I didn't initially and will push a fix for that~~ fixed in 1f2cd75).

I followed the advice on Labels (https://prometheus.io/docs/practices/instrumentation/#use-labels) and not to overuse them (https://prometheus.io/docs/practices/instrumentation/#do-not-overuse-labels) - I could have added a label for each field of a resource.TestResult, but I didn't think I'd need that ever, and I definitely didn't think it would be worth the cardinality explosion that would have been at risk if I had.

~~I chose to not use the prometheus Default Registry, because this also adds in process and golang metrics that I don't think I need. This is the registry = prometheus.NewRegistry() and .With(registry) bits inside the var block.~~ It's easy to swap back to the defaults again, and the change in doing so would be additive, not breaking, to the output.

I chose to reverse that, since it was simply easier to then also have an always-on /metrics endpoint. That's what that ecosystem expects and uses by default.

Aug 24 '20 22:08 petemounce

@aelsabbahy this is ready for review, I think. If it's a welcome PR, then the following people might be interested too based on the previous tickets I've seen. As with #585, I'll let you decide whether/when to do that.

#175 pysysops beevelop alexrudd
#362 & #363 Smithx10 pysysops karimiehsan90 harre-orz timeu timbirk berney SuperQ mleyden

Aug 24 '20 23:08 petemounce

@petemounce I think its important for visualization over time that the operator can discover what goss check failed and when.

The format in PR https://github.com/aelsabbahy/goss/pull/175, looks like it would provide the operator with all the information they need.

Aug 26 '20 20:08 Smithx10

@Smithx10 I agree that would be useful, but I don't think it would outweigh the storage costs and operational risk it would incur on the prometheus server.

I followed the advice on Labels (https://prometheus.io/docs/practices/instrumentation/#use-labels) and not to overuse them (https://prometheus.io/docs/practices/instrumentation/#do-not-overuse-labels) - I could have added a label for each field of a resource.TestResult, but I didn't think I'd need that ever, and I definitely didn't think it would be worth the cardinality explosion that would have been at risk if I had.

The pertinent point from the 2nd link:

As a general guideline, try to keep the cardinality of your metrics below 10, and for metrics that exceed that, aim to limit them to a handful across your whole system. The vast majority of your metrics should have no labels. If you have a metric that has a cardinality over 100 or the potential to grow that large, investigate alternate solutions such as reducing the number of dimensions or moving the analysis away from monitoring and to a general-purpose processing system. To give you a better idea of the underlying numbers, let's look at node_exporter. node_exporter exposes metrics for every mounted filesystem. Every node will have in the tens of timeseries for, say, node_filesystem_avail. If you have 10,000 nodes, you will end up with roughly 100,000 timeseries for node_filesystem_avail, which is fine for Prometheus to handle. If you were to now add quota per user, you would quickly reach a double digit number of millions with 10,000 users on 10,000 nodes. This is too much for the current implementation of Prometheus. Even with smaller numbers, there's an opportunity cost as you can't have other, potentially more useful metrics on this machine any more. If you are unsure, start with no labels and add more labels over time as concrete use cases arise.

Remember also that prometheus doesn't currently have downsampling/compaction (https://github.com/prometheus-junkyard/tsdb/issues/56 and https://github.com/prometheus-junkyard/tsdb/issues/313) - so if a single value is exported that is never seen again, the space on disk for that time series is only reclaimed after the global retention period elapses.

Adding the resource_id to the label set introduces literally unbounded cardinality, and therefore risk, to the monitoring storage.

My working assumption has been the standard "monitoring metrics allow alerts on symptoms, which trigger investigation via metrics dashboards and reading centralised logs (which contain more context)".

Can you suggest any way(s) to compromise?

Aug 27 '20 09:08 petemounce

@petemounce How about making it configurable ? By default the endpoint only returns metrics with the type label but the user can maybe configure what additional labels to return by specifiying them as CLI arguments ? So users that's might not have that many goss tests might still decide to accept the higher storage costs for better insights in which tests failed ?

Aug 27 '20 13:08 timeu

@petemounce The goss is a test tool, I hope to output success or failure results for each test case...

Aug 27 '20 16:08 harre-orz

@timeu What would it look like to you to have it configurable? Introduce format-options for prometheus, where valid ones would be

resource-type
resource-id
outcome

Problem is, at present, there's no way I can see to easily

give an outputer default format-options
remove default format-options from an outputer
(bonus) list outputer-specific format-options in the command-line usage help, tied together

at the existing command-line option (a string slice) without quite a larger-scope change. How would the command-line interface need to change to support this?

@harre-orz

goss is a test tool, and the test outcomes are visible and scrapable within its stdout. I think it's very likely that anyone that works in a place that is going to the trouble of setting up goss to do machine-state assertions, and then the trouble to collect metrics about those centrally, also works in a place where logs are shipped centrally.
- As well as following the prometheus team's own guidance for best practices, I'm using guidance from
  - https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/
  - https://paulbellamy.com/2017/08/symptoms-not-causes
I think it's a much safer pattern to not risk the integrity of monitoring's storage as a whole by having a large blast radius like including resource-id would. If monitoring performance degrades, or worse fails catastrophically (OOMs or runs out of storage space), that's a wake-someone-up-severity incident almost everywhere, compared to machine-state assertions going bad one some subset of hosts probably more likely being a bug-severity ticket.
- Cardinality explosion is a well-documented pitfall of using a metrics/monitoring stack for something it's not intended. Google prometheus cardinality explosion and I got 79k hits. Choice reading (prometheus official doc that I've already referenced was 3rd for me):
  - https://www.robustperception.io/cardinality-is-key
    
    If you can't get such granular data in a metrics based monitoring system, doesn't that make it kinda useless? Not really, metrics based monitoring is one set of tradeoffs. It gives you real time information (and real time is expensive as a general rule) about your system in broad strokes, but has cardinality limitations. Complementing metrics are event logs, which tend to be more delayed and have fewer pieces of information, but no cardinality limitations.
  - https://blog.freshtracks.io/bomb-squad-automatic-detection-and-suppression-of-prometheus-cardinality-explosions-62ca8e02fa32

That said - would either of you (or someone else that might be reading) be up for making a pull-request into my branch that adds the ability to customise the set of labels that would be applied to the metrics being output?

Or, do that as a follow-up PR assuming this could be treated as an MVP? This PR meets, in my opinion (and hope :) ), the criteria for being merged in that it adds well-tested non-breaking-change functionality and moves towards what we all want, which is to monitor the outcomes and durations of our machine-state assertions so that we can

alert on assertions starting to fail, triggering an investigation
notice when durations start to drag out beyond whatever our scrape-interval is

(Those are my aims with this PR, at any rate)

Aug 27 '20 23:08 petemounce

@harre-orz, @Smithx10 & @timeu - friendly ping for feedback? I would say that if this merges as-is (it's quite a large PR after all), we can iterate something that satisfies us all? What do you think?

Sep 10 '20 23:09 petemounce

@petemounce : Sorry for the lack of feedback. I think that sounds like a plan.

Sep 10 '20 23:09 timeu

@harre-orz @Smithx10 - do you have any feedback?

Sep 18 '20 21:09 petemounce

@petemounce Sorry late a feedback. I understand that cardinality explosion is a risk, I think your code is good.

Sep 19 '20 12:09 harre-orz

@harre-orz no worries; thanks for getting back to me. :)

Sep 19 '20 13:09 petemounce

@Smithx10 wondering if you have any feedback on this?

Oct 03 '20 16:10 aelsabbahy

Note - I just added overall outcome counting, and labels to the overall outcome/duration metrics.

Oct 07 '20 12:10 petemounce

Note to watchers - with the change inside #611, I think that would facilitate a follow-on that makes the prometheus output deliver resource-ids if someone wants that to be possible.

I'd prefer that to be non-default for the reasoning above.

That would look roughly like:

making prometheus support the verbose format-option within ValidOptions()
make the counters be part of the Prometheus struct rather than package variables
if that's set via the command-line
- adjust the NewCounterVec calls to add a labelTestId which receives the resource's ID.
- adjust the Prometheus.init to initialise the counters with that extra label

Oct 18 '20 10:10 petemounce

@aelsabbahy I think this is ready to land now, assuming it passes re-review and CI. If it fails CI I'll fix that, obviously, but later today.

Oct 18 '20 10:10 petemounce

Since there are no objections from the community. I guess this is close to being merged.

@petemounce, my only question/concern is this PR adds prometheus as an always-on endpoint anytime goss serve is called. This behavior doesn't feel like a default I would be expect from a tool as an end-user.

Oct 27 '20 05:10 aelsabbahy

My mental model here is that goss serve is -- in contrast to the other commands that are point-in-time attended invocations -- a persistent server process. So I don't think of goss serve as a tool, but a server embedded into a tool.

I think of persistent servers as unattended processes, and therefore I need a way to know that they are functional and healthy without investing personal attention to them.

For me, that's where alerting based on metrics comes in; that relieves me of the duty of watching the processes in favour of delegating that to a computer. An alert will interrupt me to tell me something interesting is happening, rather than me needing to poll.

So, that's why the metrics endpoint is always active; it's the endpoint that a Prometheus instance will regularly come to and scrape in its pull-mode default setup (https://prometheus.io/docs/prometheus/latest/getting_started/#configuring-prometheus-to-monitor-itself). Pull mode is the default because it is more defensive and therefore both safer and easier to scale as the monitored fleet increases in number (https://prometheus.io/docs/introduction/faq/#why-do-you-pull-rather-than-push).

In the same way, my fleet of build-farm instances all use goss serve to expose a persistent health check endpoint so that when one gets unhealthy, it is automatically killed then replaced.

The and healthy party from above, now; I need to be able to have insight into the processes that I'm hosting to be able to prioritise work to optimise my system. So, that's why I have retained the process and go metrics that the Prometheus default registry gathers.

In practice, the metrics are crawled internal to the process when either the metrics endpoint is requested, or the health endpoint is requested with the Prometheus accept header.

The load that generates is, I predict, low - however, it will now also be quantifiable.

As well as the utility to customers, we could use this inside a load test and fail builds if we regress.

Oct 27 '20 08:10 petemounce

Ok, so trying this PR a bit more and I'm a bit unclear on a few things:

What's the best way to test this black box (using curl). What should I be expecting to see? I tried testing it, but the output didn't make sense to me, no goss results.. requires restarting goss serve, etc..
My question with regards to goss serve and always-on is more along the lines that this PR makes it so you can't run goss serve without getting the Prometheus /metrics endpoint, there's no way to disable this. Also, the /metrics endpoint seems to be hardcoded for only Prometheus, how will this scale to other pull-model systems in the future?
Logging and behavior is inconsistent between /healthz and /metrics, not sure if this was intended or not.
I assume caching is disabled /metrics intentionally. (I would assume this is desired, given the nature of metrics).

Regarding item 1 above, can you provide different examples of multiple runs of curl against goss serve with tests passing and failing and what the expected results are? It seems buggy to me, but I could just be a misunderstanding.

Nov 02 '20 16:11 aelsabbahy

@aelsabbahy thanks for looking into this.

What's the best way to test this black box (using curl). What should I be expecting to see? I tried testing it, but the output didn't make sense to me, no goss results.. requires restarting goss serve, etc..

Just curl http://localhost:8080/metrics should give you back some prometheus text format metrics - the process ones, the golang-internal ones, and the goss-specific ones.

My question with regards to goss serve and always-on is more along the lines that this PR makes it so you can't run goss serve without getting the Prometheus /metrics endpoint, there's no way to disable this.

Oh, I see. Yes, there's no way to disable this, since I thought it was more trouble to allow that than was worthwhile, because:

if someone isn't using /metrics, they won't request it
... and so no load relating to it will be generated

Happy to revisit if you feed back you'd prefer an off switch? If so - would that be ok to do in a follow-up, where we can decide on things like the flag naming w.r.t. prometheus-specific/more-generalised?

Also, the /metrics endpoint seems to be hardcoded for only Prometheus, how will this scale to other pull-model systems in the future?

It is, and I don't know how it will scale to other pull-model systems in the future - I don't know what those are, so I have tried to keep what I do know tightly scoped and simple. For example, it'd be pretty easy to later do all or some of

supply a --metrics-path or --prometheus-metrics-path flag to define what's currently hardcoded to /metrics

In fact, the Accept header handling is also different on the /metrics endpoint because prometheus itself handles different formats if wanted.

Logging and behavior is inconsistent between /healthz and /metrics, not sure if this was intended or not.

I actually completely haven't looked at consistency there, so this was unintentional. I was focused on "works" so far. Which inconsistencies would you like adjusting, if any?

I assume caching is disabled /metrics intentionally. (I would assume this is desired, given the nature of metrics).

Yes, that's right - since /metrics doesn't cause any tests to be run, I didn't think caching there was necessary. The expected use case here is

automated cloud-provider health check pings /healthz regularly with Accept: ...prometheus - causing tests to be run. On the order of 1 per minute.
- (dammit - now I've checked the docs, I can't do that! I'll submit a PR around the accept header behaviour being query-string-possible too)
prometheus scrapes /metrics regularly - on the order of 12 per minute

Regarding item 1 above, can you provide different examples of multiple runs of curl against goss serve with tests passing and failing and what the expected results are? It seems buggy to me, but I could just be a misunderstanding.

curl: https://github.com/aelsabbahy/goss/pull/607/files#diff-5c297f761921b9217528855917c89263456bb6091464f9ee401f297693561162R100-R103
unit-tests: https://github.com/aelsabbahy/goss/pull/607/files#diff-f6722a25b82fd84cd295dd8f5b535dc09524d4755a3a11ef4d30c9d62c2602ffR15-R60

The endpoint keeps track of the history of the increments to the counters since the start of goss serve.

The curl tests don't make assertions that are precise enough to capture that fact, because at any point someone might add a new test ahead of the metrics pair there, and since (the current implementation of) the goss serve integration tests there use a single process for all of the tests, there's coupling.

Would you like more detail / changes to make that clearer? Totally happy to if so.

Nov 02 '20 20:11 petemounce

Ok, now I have a much better understanding of this PR and how it works. Your last post clarified a lot, thank you for that. Hopefully, this leads to more constructive feedback from me.

My thoughts:

So, at a high level, this PR introduces a new concept of "metrics" which is a great new feature/paradigm, but I think it's added at the wrong level. It is currently being added as an output type.. this works and requires minimal change... but feels wrong.

My suggestion:

Metrics should be its own concept. As tests run, regardless of output type, goss records metrics internally. Basicallly, the flow should be:

stream of test results -> metrics_recorder() -> outputter()
metrics can then be rendered by the "/metrics" endpoint.

This would provide the following benefits:

Decouple metrics gathering from the output
Makes it easier in the future to either toggle metrics or change metrics formatting

Nov 03 '20 18:11 aelsabbahy

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Jan 02 '21 20:01 stale[bot]

not so stale. but maybe stale, but if it really is it is very disappointing

Jan 03 '21 02:01 freeseacher

I intend to continue this, but I don't have the brain space quite at the moment due to some personal life things.

@aelsabbahy does it make sense to increment the outcome metric at the point it becomes known, within the channel (of that's the right term)? If so, then I think adding a start time to a test result struct might be really all that's needed; when a metrics request arrives, we render whatever is current.

Jan 03 '21 21:01 petemounce

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Mar 04 '21 21:03 stale[bot]

Dear watchers; apologies for latency. I've currently got some quite impactful stuff going on in my life and while I do intend to finish this, I cannot commit to a timeline. (If another would like to build on top of this and finish it earlier along the lines discussed I'd have no issue with that either).

Mar 05 '21 07:03 petemounce

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

May 04 '21 08:05 stale[bot]

not stale

May 04 '21 09:05 freeseacher

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Jul 03 '21 12:07 stale[bot]

:( not stale

Jul 03 '21 19:07 freeseacher

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Sep 02 '21 02:09 stale[bot]

Sep 03 '21 16:09 freeseacher

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Nov 02 '21 18:11 stale[bot]

dissapointed

Nov 10 '21 19:11 freeseacher

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Jan 12 '22 00:01 stale[bot]

dissapointed more

Jan 12 '22 19:01 freeseacher

@aelsabbahy I think, it would be great to have prometheus output. If I understand correctly, this working PR was stopped by your comment https://github.com/aelsabbahy/goss/pull/607#issuecomment-721302850 about the /metrics endpoint.

I don't think I can do the proposed refactoring, but I think, I could just rip put the metrics and make this PR smaller. Would this help to have it merged?

Apr 28 '22 15:04 t2d

The intent of this PR is to add metrics support. I haven't understood what you propose to remove from this to enable its merge?

Apr 28 '22 16:04 petemounce

First, thanks for your work @petemounce ! Second, I don't need /metrics today. But I'm trying to replace nagios with prometheus and goss is one of the last issues. It would be enough if I could just collect the info from /healthz in a manner that prometheus understands.

Apr 28 '22 16:04 t2d

It's been so long.. so memory is a bit fuzzy on this.

What's the current status of the PR and what remains?

@petemounce Last two comments from you seems to indicate there was some remaining work outstanding?

Apr 28 '22 16:04 aelsabbahy

First, thanks for your work @petemounce ! Second, I don't need /metrics today. But I'm trying to replace nagios with prometheus and goss is one of the last issues. It would be enough if I could just collect the info from /healthz in a manner that prometheus understands.

Both endpoints are served via the same implementation as far as I recall.

@aelsabbahy - basically, the remaining work is what you described within https://github.com/aelsabbahy/goss/pull/607#issuecomment-721302850

Apr 28 '22 19:04 petemounce

If this is good to go, feel free to rebase and I can merge it.

May 20 '22 13:05 aelsabbahy

I don't think it is, because of the still-outstanding work to make metrics a concept introduced at the level you directed in https://github.com/aelsabbahy/goss/pull/607#issuecomment-721302850

May 20 '22 14:05 petemounce

goss goss copied to clipboard

Implement a prometheus output for test-metrics scraping

Checklist

Description of change

Sample (from unit-test log):

Implementation details and notes

goss
goss copied to clipboard