goss icon indicating copy to clipboard operation
goss copied to clipboard

Add Prometheus HTTP Endpoint

Open Smithx10 opened this issue 7 years ago • 25 comments

It would be nice for goss validations to be consumed by Prometheus.io . I will work on this a little.

Smithx10 avatar Jun 19 '18 22:06 Smithx10

I know there was a previous efforts to implement this with pr https://github.com/aelsabbahy/goss/pull/175

Might be helpful

aelsabbahy avatar Jun 19 '18 22:06 aelsabbahy

@aelsabbahy,

Currently I have something basic working with the prometheus client.

https://github.com/Smithx10/goss/commit/b336a7e146843393f4a5266495cf62525f2ae131

I'd like to have the metric name be based on the resource type, but couldn't figure out how to iterate over the NewGaugeVec Constructor Name attribute, without causing prometheus to panic because of duplicates.

gossGauge = prometheus.NewGaugeVec(prometheus.GaugeOpts{
		Name: "goss",
		Help: "Lets you know if goss assertions were true 0, or false 1"},
		[]string{"resource_type", "resource_id", "property", "title"},
	)
}

Do you think these Gauges should be specific to each resource types?

Here is the output:

# HELP goss Lets you know if goss assertions were true 0, or false 1
# TYPE goss gauge
goss{property="exists",resource_id="bruce.smith",resource_type="User",title=""} 1
goss{property="exists",resource_id="jim",resource_type="User",title=""} 1
goss{property="exists",resource_id="smith",resource_type="User",title=""} 1
goss{property="ip",resource_id="tcp:8080",resource_type="Port",title=""} 0
goss{property="listening",resource_id="tcp:8080",resource_type="Port",title=""} 1

Smithx10 avatar Jun 21 '18 01:06 Smithx10

Hey @Smithx10 👋 I once had a go (the original #175 attempt) wasn't really sure of what the expected output or use should be at the time. I'd be glad to collaborate / workout how goss and prometheus could work happily together?

pysysops avatar Jun 23 '18 19:06 pysysops

I have a question around the actual value / metric produced, should it be binary 1 / 0 or would the time it took to complete the check be useful for helping identify slow DNS or HTTP reponses? Or any unusual slowness of any resource like the filesystem?

I had thought that a simple true or false value would be better suited to a metric label? I don't use Prometheus so maybe you have some experience or opinions you can share in terms of it's usage and what good looks like in the Prometheus world?

pysysops avatar Jun 23 '18 20:06 pysysops

I haven't used pormetheus myself so I can't really comment on this.

If you guys come to an agreement, it makes sense to add this to goss. If it ends up having many different opinions I'm wondering if it makes sense as a sample script in extra/ that parses the Goss json output and reformats it?

What do you guys think?

aelsabbahy avatar Jun 24 '18 00:06 aelsabbahy

Thanks @aelsabbahy I made use of this quick docker-compose prometheus stack: https://github.com/vegasbrianc/prometheus and came up with #363 which is similar to the above idea / output but formats results to Prometheus text format output rather than using the client library.

pysysops avatar Jun 24 '18 09:06 pysysops

I'm thinking of creating a prometheus collector instead in order to get similar metric names like @pysysops has. Hopefully i'll get some time and figure it out.

Smithx10 avatar Jun 24 '18 12:06 Smithx10

@pysysops ,

I'm not really using goss as a tool for health checking applications, because I really don't believe that's the ethos of the project. All I, as a Prometheus / Grafana user am interested in is finding out when my configuration drifted, for how long it drifted, and if it came back from the drift. This is valuable if multiple actors are acting upon an infrastructure. Ex. Someone logged in and changed a configuration, and event handler etc.

Does that make sense?

Smithx10 avatar Jul 06 '18 14:07 Smithx10

Isn't one possible way of getting prometheus and goss to work together to use goss serve and the blackbox_exporter (https://github.com/prometheus/blackbox_exporter) for checking the http status code?

mgier avatar Aug 23 '19 06:08 mgier

Seems work on this started and stopped multiple times with no agreement on solution.

Anyone here know if: A. This is possible or is everyone's needs unique? B. If it makes sense to be a goss output format?

aelsabbahy avatar Dec 18 '19 15:12 aelsabbahy

@aelsabbahy Hi I am using goss and need this feature.Please add prometheus exporter to goss.

karimiehsan90 avatar Dec 19 '19 13:12 karimiehsan90

@karimiehsan90 This will have to be submitted by a contributor since I personally never used Prometheus. As long as a PR is submitted that is agrreed upon.

The one limitation I will say is that goss can have a Prometheus output format, but it should not be pushing results over the network.

aelsabbahy avatar Dec 19 '19 15:12 aelsabbahy

Hi

Please excuse my poor english.

I also implemented the goss with prometheus output. https://github.com/harre-orz/goss/commit/608323699f39d0c4823e7dff6d932e74fc8e758b

The output format is as follows.

  1. Collect information of success = 0, failure = 1, skiped = 2 with goss_result (like https://github.com/Smithx10/goss/commit/b336a7e146843393f4a5266495cf62525f2ae131)
# HELP goss_result Lets you know if goss assertions were true 0, or false 1, or skip 2
# TYPE goss_result gauge
goss_result{property="enabled",resource_id="sshd",resource_type="Service",title=""} 1
goss_result{property="running",resource_id="sshd",resource_type="Process",title=""} 0
goss_result{property="running",resource_id="sshd",resource_type="Service",title=""} 2
  1. Collect execution time with goss_duration
# HELP goss_duration Lets you know duration of goss execution
# TYPE goss_duration gauge
goss_duration{property="enabled",resource_id="sshd",resource_type="Service",title=""} 0.007284487
goss_duration{property="running",resource_id="sshd",resource_type="Process",title=""} 1.125e-05
goss_duration{property="running",resource_id="sshd",resource_type="Service",title=""} 0

I think it is necessary to collect the goss_result and goss_duration metrics separately, and labels the goss_result and goss_duration equally.

Considering different formats

If append result label the results as shown below, the dimensions will be different and I will not be able to efficiently a PromQL.

# HELP goss_result bad example 1
# TYPE goss_result gauge
goss_result{property="enabled",resource_id="sshd",resource_type="Service",title="", result="success"} 1
goss_result{property="running",resource_id="sshd",resource_type="Process",title="", result="failure"} 1
goss_result{property="running",resource_id="sshd",resource_type="Service",title="", result="skipped"} 1

If match the dimensions, I will need 3 times more metrics as shown below.

# HELP goss_result bad example 2
# TYPE goss_result gauge
goss_result{property="enabled",resource_id="sshd",resource_type="Service",title="", result="success"} 1
goss_result{property="enabled",resource_id="sshd",resource_type="Service",title="", result="failure"} 0
goss_result{property="enabled",resource_id="sshd",resource_type="Service",title="", result="skipped"} 0
goss_result{property="running",resource_id="sshd",resource_type="Process",title="", result="success"} 0
goss_result{property="running",resource_id="sshd",resource_type="Process",title="", result="failure"} 1
goss_result{property="running",resource_id="sshd",resource_type="Process",title="", result="skipped"} 0
goss_result{property="running",resource_id="sshd",resource_type="Service",title="", result="success"} 0
goss_result{property="running",resource_id="sshd",resource_type="Service",title="", result="failure"} 0
goss_result{property="running",resource_id="sshd",resource_type="Service",title="", result="skipped"} 1

Therefore, I represent the goss_result metric for numerically of succes = 0, failure = 1, skipped = 2.

harre-orz avatar May 24 '20 14:05 harre-orz

@harre-orz Looks good. I would have 2 suggestions:

  1. Add the unit to the goss_duration metric (i.e. goss_duration_seconds) according to the best practices of Prometheus metric naming
  2. Maybe add 4 more metrics for: (goss_tests_total, goss_tests_failed_total and goss_tests_skipped_total, goss_test_duration_seconds) so it might be easier to get those numbers although potentially some could be retrieved with promql

timeu avatar May 28 '20 20:05 timeu

Hi @timeu, Thank you for your good suggestions.

  1. goss_duration_seconds is a good name. Changed to it. https://github.com/harre-orz/goss/commit/ab578eaa03616b196f5772f66662a596c0ad69ca

  2. I think for summary metrics are not needed, because calculate by PromQL

# Ex1: get test total count by PromQL
count(goss_result{}) or on() vector(0)

# Ex2: get failed test count by PromQL 
count(goss_result{} == 1) or on() vector(0)

But, this PromQL is difficult. Do you think best for create summary metrics?

If you want to include summary metrics, suggestion for output format bellow:

# HELP goss_tests_count Test count of goss assertions
# TYPE goss_tests_count gauge
goss_tests_count 3
# HELP goss_tests_failed_count Test failed count of goss assertions
# TYPE goss_tests_failed_count gauge
goss_tests_failed_count 1
# HELP goss_tests_skipped_count Test skipped count of goss assertions
# TYPE goss_tests_skipped_count gauge
goss_tests_skipped_count 1

https://github.com/harre-orz/goss/commit/721e1ab0ff842033bddd464f1dbe5e999f54c83e

harre-orz avatar May 29 '20 14:05 harre-orz

@harre-orz : You might be right regarding the summary statistics. I am just wondering if the other output formats (json, etc) output those summary statistics as well and if it makes sense to have the prometheus output aligned like that ? I am not sure what the best practice is regarding prometheus and those summary statistics.

If you output the summary statistics then I would recommend to use: goss_tests_skipped_total instead of goss_tests_skipped_count (also for the other ones). According to https://prometheus.io/docs/practices/naming/ and https://prometheus.io/docs/instrumenting/writing_exporters/ _count is for summaries and _total is for a regular counters

Edit: If you output the summary statistics, I would also output goss_tests_duration_seconds

Edit2: Thinking a bit more about _total vs _count suffix, I think in case of total test results _count could be also correct.

timeu avatar May 29 '20 14:05 timeu

Thank you @timeu, I think it's better to include the execution time of goss.

The execution time of goss (goss_tests_duration_seconds) is not equal to the sum of goss_duration_seconds. Other formats (Ex: json) have similar specifications.

The goss_tests_duration_seconds metric is as follows:

# HELP goss_tests_duration_seconds Execution time of goss assertions
# TYPE goss_tests_duration_seconds gauge
goss_tests_duration_seconds 0.013728257

https://github.com/harre-orz/goss/commit/9c947dadab1a52af441980217139c042d963f743

I think _total is not a proper suffix for accumulating count. I read https://prometheus.io/docs/instrumenting/writing_exporters/, but maybe _sum instead of _count.

harre-orz avatar May 30 '20 14:05 harre-orz

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jul 29 '20 15:07 stale[bot]

For those that commented on this issue, please see #607 PR. I would be interested in a review from the community and don't want to merge something in that's not agreed upon.

aelsabbahy avatar Aug 25 '20 16:08 aelsabbahy

Last call for feedback on #607 from @petemounce

I will most likely merge it in a week or so it I no one has objections on current implementation.

aelsabbahy avatar Oct 03 '20 16:10 aelsabbahy

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Dec 02 '20 17:12 stale[bot]

up. not stale.

freeseacher avatar Dec 02 '20 19:12 freeseacher

Marked as approved so stale bot leaves it alone.

PR/implementation still being worked on, but at this point metrics endpoint (prometheus) is approved.

aelsabbahy avatar Dec 02 '20 22:12 aelsabbahy

If somebody needs this, I created a sidecar container, who does the exporting at https://github.com/DracoBlue/goss-metrics-exporter

DracoBlue avatar Feb 22 '21 07:02 DracoBlue

Hello all, saw a new attempt at this here: #771

I would love some feedback/reviews from those interested here if this is the preferred approach for the community over #607

aelsabbahy avatar Sep 01 '22 16:09 aelsabbahy

#607 has been merged, marking this as closed since it will be in the next release.

Thank you all for your time, opinions, and contributions!

aelsabbahy avatar Oct 07 '22 15:10 aelsabbahy