atlantis icon indicating copy to clipboard operation
atlantis copied to clipboard

Expose metrics

Open lkysow opened this issue 6 years ago • 25 comments

Via @mechastorm, they would like Atlantis to expose metrics around:

  • number of plan/applys
  • number of errors encountered
  • time when plan ran successful after error was detected (that would our MTTR - mean time to recover)

lkysow avatar Sep 07 '18 18:09 lkysow

Prometheus please

psalaberria002 avatar Sep 17 '18 20:09 psalaberria002

Prometheus please

I'd like to focus on providing and endpoint /metrics that provides a JSON response. This allows it to be scraped and mutated by monitoring solutions rather than saying you need to run prometheus to get metrics out. In the long term building in specific support for prometheus, graphite, statsd, etc more natively might be nice but I think in order to get the most bang for your buck the initial implementation should be inclusive rather than rely on a single common piece of tech. Just my $0.02.

majormoses avatar Sep 17 '18 22:09 majormoses

Any progress on this one?

psalaberria002 avatar Nov 01 '18 08:11 psalaberria002

Nope!

lkysow avatar Nov 01 '18 16:11 lkysow

Here's a basic RFC for this.
https://docs.google.com/document/d/1GwCvqEzQx0B-tEtq4T4H_LJ_7IddIP_ItmlM1zUTG2I/edit

kent-b avatar Apr 09 '19 12:04 kent-b

https://openmetrics.io/ could be an option, although it's still in its infancy

gwkunze avatar May 08 '19 11:05 gwkunze

@lkysow How do you think metrics should be collected and exposed? Any preference?

I think we should use an existing library for collecting metrics (Prometheus, Openmetrics in the future?,...), and not reinvent the wheel.. There are hundreds of Prometheus exporters, so you just need a sidecar to expose them in your preferred format or to send them your metrics store.

psalaberria002 avatar Sep 25 '19 10:09 psalaberria002

Prometheus please

I'd like to focus on providing and endpoint /metrics that provides a JSON response. This allows it to be scraped and mutated by monitoring solutions rather than saying you need to run prometheus to get metrics out. In the long term building in specific support for prometheus, graphite, statsd, etc more natively might be nice but I think in order to get the most bang for your buck the initial implementation should be inclusive rather than rely on a single common piece of tech. Just my $0.02.

You could default to exposing as JSON and give the option (URL parameter) to change the format to something else, i.e. Prometheus. Consul and Nomad allow for this.

xbglowx avatar Sep 25 '19 13:09 xbglowx

@xbglowx How do they do metric collection internally? Have they reimplemented Counters, Gauges, Histograms, etc?

Edit: Ok, they are using https://github.com/armon/go-metrics which could be an option. I am gonna give that a try.

psalaberria002 avatar Sep 25 '19 14:09 psalaberria002

Thay library only supports Gauges and Counters. And personally I don't like that it tries to deal with all kind of sinks. I don't think that logic should be built within Atlantis. Sidecar extractors solve the issue in a much cleaner manner.

psalaberria002 avatar Sep 25 '19 15:09 psalaberria002

@caryyu please use the reactions on the post rather than adding comments.

lkysow avatar Nov 01 '19 16:11 lkysow

Datadog integration?

waltervargas avatar Apr 09 '20 09:04 waltervargas

In lieu of metrics support, how are people currently monitoring their atlantis deploy to make sure it's healthy?

cep21 avatar Aug 30 '20 08:08 cep21

Another option could be to log metrics in some structured format like EMF: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format_Specification.html

At least in AWS this would be easy to parse out into Cloudwatch metrics. Not sure if any other tools have added support for the spec.

mwarkentin avatar Aug 30 '20 17:08 mwarkentin

It's going to be tough to get approval to use Atlantis without a prometheus metrics endpoint. I'm wondering how others are monitoring Atlantis uptime?

tewing avatar Oct 13 '20 19:10 tewing

Absolutely agree that we should add this, it's just not something that IMO ranks as the most important problem for atlantis to solve at this moment. That is not to say my opinion is important :wink: . If you or your org feel it is, this is OSS and someone (props to @psalaberria002 for taking a swing at it) can invest developer time or hire a contractor to build the feature. As Luke said always vote with :+1:/:-1: on the main comment to show your support/opposition to an issue as that is what github lets you sort on.

I know everyone (self included) loves data but in absence of data I can offer anecdotal advice on real usage. I have run atlantis at multiple orgs for years and have had 0 problems from an uptime perspective. We ran it on fairly typical instances (something like a t or m medium/large instance class) ec2 instance. If we have a lot going on then we see some elevated cpu (terraform) but every time terraform was the cause and resources are released after terraform finishes executing. I have not observed any memory, file descriptor, or other resource leaks in a number of years. I can't say that about many projects that do offer such metrics :laughing:. Standard resource monitoring and an http health check have so far worked out pretty well for me. I mostly used cloudwatch on the (E|A)LB (which also offloaded TLS) and sensu for your standard resource (disk, memory, cpu, network, etc) but those could be just about anything. I found it to be much more CPU bound so you really wanted to tune it I would stick with a c class instance instead. I think if I had to pick one metric I would wish for it would be the longest running plan to catch times where we have been rate limited (I am looking at you Github). We do have plans to move our atlantis instance into k8s next quarter I will let you know what we end up changing out if anything.

Personal Plea/Rant to the Industry: there is no one single monitoring system in my experience that covers everything and does it the best. They all have their strengths and weaknesses. Saying someone can't use a solution because it is not supported by a specific monitoring product is ludicrous at an engineering organization. There are always options, while it might not be sexy running a sidecar for something like atlantis can work just fine for many use cases. I had to build monitoring for production docker setups before there were projects like prometheus, docker monitoring apis, docker exec, etc. We always found a clever ways to meet the needs of our customers regardless where there apps are at. Eventually the solutions mature over time and we replace the clever as it is no longer needed.

majormoses avatar Oct 14 '20 02:10 majormoses

We have metrics support in our fork, however it uses statsd in the form of github.com/lyft/gostats.

Here is the commit: https://github.com/lyft/atlantis/commit/37c200fee9f3dcd30471a99045a87e9ec902b275

If theres enough likes on this, seeing as it's already implemented, I can just upstream it for others to build upon/use. If it helps I can also have a tutorial on how to setup statsd with atlantis. I know people were expressing their desire for prometheus but this is already done and used in production so could be a starting point at least.

nishkrishnan avatar Mar 31 '21 21:03 nishkrishnan

Awesome work. Thanks @nishkrishnan

cep21 avatar Apr 02 '21 20:04 cep21

@nishkrishnan would be great to see your work here as PR ;)

haarchri avatar Jun 22 '21 18:06 haarchri

@nishkrishnan any plans to open a PR?

smitthakkar96 avatar Mar 14 '22 17:03 smitthakkar96

yeah will do, sorry about that i must have missed all this stuff.

nishkrishnan avatar Mar 14 '22 18:03 nishkrishnan

https://github.com/runatlantis/atlantis/pull/2147

nishkrishnan avatar Mar 16 '22 23:03 nishkrishnan

I have PR open to support Prometheus metrics: https://github.com/runatlantis/atlantis/pull/2204

yoonsio avatar Apr 17 '22 16:04 yoonsio

#2204 is now merged, so I believe this can be closed :) (thanks @yoonsio )

nuno-silva avatar Jul 18 '22 10:07 nuno-silva

Thanks for the work @yoonsio. however in 0.19.8 trying to implement

  metrics:
    prometheus:
      endpoint: /metrics

we get an error of

Error: initializing server: parsing /etc/atlantis/repos.yaml file: yaml: unmarshal errors:
  line 87: field prometheus not found in type raw.Metrics

which is strange because in #2204 we can clearly see prometheus being added to metrics here

ekhaydarov avatar Sep 13 '22 08:09 ekhaydarov

@ekhaydarov can you try 0.19.9 ?

Also, the whitespace in your yaml sample seems off. Just like policies and repos, metrics is also a root-level key. It should probably be documented as such.

https://github.com/runatlantis/atlantis/blob/d1d1539ced062ccb8dc2a0368b4e4c802b6799b6/server/core/config/raw/global_cfg.go#L14-L19

metrics:
  prometheus:
    endpoint: /metrics

nitrocode avatar Oct 06 '22 19:10 nitrocode