atlantis
atlantis copied to clipboard
Expose metrics
Via @mechastorm, they would like Atlantis to expose metrics around:
- number of plan/applys
- number of errors encountered
- time when plan ran successful after error was detected (that would our MTTR - mean time to recover)
Prometheus please
Prometheus please
I'd like to focus on providing and endpoint /metrics
that provides a JSON response. This allows it to be scraped and mutated by monitoring solutions rather than saying you need to run prometheus to get metrics out. In the long term building in specific support for prometheus, graphite, statsd, etc more natively might be nice but I think in order to get the most bang for your buck the initial implementation should be inclusive rather than rely on a single common piece of tech. Just my $0.02
.
Any progress on this one?
Nope!
Here's a basic RFC for this.
https://docs.google.com/document/d/1GwCvqEzQx0B-tEtq4T4H_LJ_7IddIP_ItmlM1zUTG2I/edit
https://openmetrics.io/ could be an option, although it's still in its infancy
@lkysow How do you think metrics should be collected and exposed? Any preference?
I think we should use an existing library for collecting metrics (Prometheus, Openmetrics in the future?,...), and not reinvent the wheel.. There are hundreds of Prometheus exporters, so you just need a sidecar to expose them in your preferred format or to send them your metrics store.
Prometheus please
I'd like to focus on providing and endpoint
/metrics
that provides a JSON response. This allows it to be scraped and mutated by monitoring solutions rather than saying you need to run prometheus to get metrics out. In the long term building in specific support for prometheus, graphite, statsd, etc more natively might be nice but I think in order to get the most bang for your buck the initial implementation should be inclusive rather than rely on a single common piece of tech. Just my$0.02
.
You could default to exposing as JSON and give the option (URL parameter) to change the format to something else, i.e. Prometheus. Consul and Nomad allow for this.
@xbglowx How do they do metric collection internally? Have they reimplemented Counters, Gauges, Histograms, etc?
Edit: Ok, they are using https://github.com/armon/go-metrics which could be an option. I am gonna give that a try.
Thay library only supports Gauges and Counters. And personally I don't like that it tries to deal with all kind of sinks. I don't think that logic should be built within Atlantis. Sidecar extractors solve the issue in a much cleaner manner.
@caryyu please use the reactions on the post rather than adding comments.
Datadog integration?
In lieu of metrics support, how are people currently monitoring their atlantis deploy to make sure it's healthy?
Another option could be to log metrics in some structured format like EMF: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format_Specification.html
At least in AWS this would be easy to parse out into Cloudwatch metrics. Not sure if any other tools have added support for the spec.
It's going to be tough to get approval to use Atlantis without a prometheus metrics endpoint. I'm wondering how others are monitoring Atlantis uptime?
Absolutely agree that we should add this, it's just not something that IMO ranks as the most important problem for atlantis to solve at this moment. That is not to say my opinion is important :wink: . If you or your org feel it is, this is OSS and someone (props to @psalaberria002
for taking a swing at it) can invest developer time or hire a contractor to build the feature. As Luke said always vote with :+1:/:-1: on the main comment to show your support/opposition to an issue as that is what github lets you sort on.
I know everyone (self included) loves data but in absence of data I can offer anecdotal advice on real usage. I have run atlantis at multiple orgs for years and have had 0 problems from an uptime perspective. We ran it on fairly typical instances (something like a t
or m
medium/large instance class) ec2 instance. If we have a lot going on then we see some elevated cpu (terraform) but every time terraform was the cause and resources are released after terraform finishes executing. I have not observed any memory, file descriptor, or other resource leaks in a number of years. I can't say that about many projects that do offer such metrics :laughing:. Standard resource monitoring and an http health check have so far worked out pretty well for me. I mostly used cloudwatch on the (E|A)LB
(which also offloaded TLS) and sensu for your standard resource (disk, memory, cpu, network, etc) but those could be just about anything. I found it to be much more CPU bound so you really wanted to tune it I would stick with a c
class instance instead. I think if I had to pick one metric I would wish for it would be the longest running plan to catch times where we have been rate limited (I am looking at you Github). We do have plans to move our atlantis instance into k8s next quarter I will let you know what we end up changing out if anything.
Personal Plea/Rant to the Industry: there is no one single monitoring system in my experience that covers everything and does it the best. They all have their strengths and weaknesses. Saying someone can't use a solution because it is not supported by a specific monitoring product is ludicrous at an engineering organization. There are always options, while it might not be sexy running a sidecar for something like atlantis can work just fine for many use cases. I had to build monitoring for production docker setups before there were projects like prometheus, docker monitoring apis, docker exec, etc. We always found a clever ways to meet the needs of our customers regardless where there apps are at. Eventually the solutions mature over time and we replace the clever as it is no longer needed.
We have metrics support in our fork, however it uses statsd in the form of github.com/lyft/gostats.
Here is the commit: https://github.com/lyft/atlantis/commit/37c200fee9f3dcd30471a99045a87e9ec902b275
If theres enough likes on this, seeing as it's already implemented, I can just upstream it for others to build upon/use. If it helps I can also have a tutorial on how to setup statsd with atlantis. I know people were expressing their desire for prometheus but this is already done and used in production so could be a starting point at least.
Awesome work. Thanks @nishkrishnan
@nishkrishnan would be great to see your work here as PR ;)
@nishkrishnan any plans to open a PR?
yeah will do, sorry about that i must have missed all this stuff.
https://github.com/runatlantis/atlantis/pull/2147
I have PR open to support Prometheus metrics: https://github.com/runatlantis/atlantis/pull/2204
#2204 is now merged, so I believe this can be closed :) (thanks @yoonsio )
Thanks for the work @yoonsio. however in 0.19.8 trying to implement
metrics:
prometheus:
endpoint: /metrics
we get an error of
Error: initializing server: parsing /etc/atlantis/repos.yaml file: yaml: unmarshal errors:
line 87: field prometheus not found in type raw.Metrics
which is strange because in #2204 we can clearly see prometheus being added to metrics here
@ekhaydarov can you try 0.19.9 ?
Also, the whitespace in your yaml sample seems off. Just like policies
and repos
, metrics
is also a root-level key. It should probably be documented as such.
https://github.com/runatlantis/atlantis/blob/d1d1539ced062ccb8dc2a0368b4e4c802b6799b6/server/core/config/raw/global_cfg.go#L14-L19
metrics:
prometheus:
endpoint: /metrics