operations
operations copied to clipboard
Switch to Prometheus for monitoring and alerting
We currently use Munin for monitoring system metrics and Munin's built in alerting. Munin has very limited functionality and difficult to extend further. The alerting framework in Munin is very poor.
We should consider migration to using Prometheus and Alertmananger.
This issue is to start a discussion and work out what planning may be required.
I'm pretty sure I look at this before when you mentioned it and it's a mind boggling amount of work and would cover far less than we are currently able to monitor.
Plus the whole thing is written in the abomination that is go.
Looking at it again the really big problem, besides the limited set of exporters (ie things you can monitor) is that just exporting statistics to the database doesn't get you anything - unlike munin there is no default view of the statistics and you have to use the web UI to build custom dashboards for everything you want to monitor.
Basically it all sounds incredibly painful to use.
Another problem, if I'm understanding things right, is that each exporter runs as a separate service which means that you have to somehow manage allocating a unique port to each one. That's doable, if annoying, but it will mean us needing to get various places to open up a bunch of new firewall holes.
How about Zabbix maybe? (php+pgsql)
How about Zabbix maybe? (php+pgsql)
After years of Zabbix I am still unable to like it. I was begging them for years to convert the time-series backend into an abstraction to be able to use whatever fast and suitable TS system (there's plenty, from clickhouse through carbon to tsdb and even prometheus itself), but they still kind of trying to get it done and choose elastic as the first PoC target which is a bit disappointing. Without it the TS handling is slow and clunky. (Disclaimer: And yeah, I'm racist and it's in php, eeek.)
I have tried to reply this ticket meaningfully but I had to realise I can't really offer The Solution. Most people are using prometheus, indeed. I fancy modern, horizontally and vertically scalable TS backends (I'm playing with clickhouse but even elasticsearch, as wasteful as it is, is pretty good), but it's mainly for monitoring, not for alerting. I am still hesitating between nagios and prometheus for alerting, but just seen netdata the other day and it's pretty convincing realtime stuff with optional TS backend support and automagic clustering AND alerting. The last week or so I was playing with netdata
, testing the collection and trying to squeeze it into clickhouse, still in the middle of it, so I can't suggest anything as of now.
Right now all of our monitoring is very server-oriented. This fits with Munin's model of querying a server every 5 minutes. Whatever we should pick should have support for more service-based metrics that could come from sources like logfiles. We could also use integration with metrics from our cloud providers like S3 usage, as well as sampling at different rates for metrics that are slow to generate.
I know all of the above is possible with Munin, but it's not a natural fit.
Is Promotheus set to replace Munin, and if so, are there plans to provide a public dashboard with similar coverage in Grafana?
It is being evaluated. Decisions on these questions will naturally follow once the evaluation is complete.