operations icon indicating copy to clipboard operation
operations copied to clipboard

Switch to Prometheus for monitoring and alerting

Open Firefishy opened this issue 4 years ago • 9 comments

We currently use Munin for monitoring system metrics and Munin's built in alerting. Munin has very limited functionality and difficult to extend further. The alerting framework in Munin is very poor.

We should consider migration to using Prometheus and Alertmananger.

This issue is to start a discussion and work out what planning may be required.

Firefishy avatar Feb 04 '20 15:02 Firefishy

I'm pretty sure I look at this before when you mentioned it and it's a mind boggling amount of work and would cover far less than we are currently able to monitor.

tomhughes avatar Feb 04 '20 15:02 tomhughes

Plus the whole thing is written in the abomination that is go.

tomhughes avatar Feb 04 '20 15:02 tomhughes

Looking at it again the really big problem, besides the limited set of exporters (ie things you can monitor) is that just exporting statistics to the database doesn't get you anything - unlike munin there is no default view of the statistics and you have to use the web UI to build custom dashboards for everything you want to monitor.

Basically it all sounds incredibly painful to use.

tomhughes avatar Feb 04 '20 15:02 tomhughes

Another problem, if I'm understanding things right, is that each exporter runs as a separate service which means that you have to somehow manage allocating a unique port to each one. That's doable, if annoying, but it will mean us needing to get various places to open up a bunch of new firewall holes.

tomhughes avatar Feb 04 '20 17:02 tomhughes

How about Zabbix maybe? (php+pgsql)

hbogner avatar Feb 10 '20 12:02 hbogner

How about Zabbix maybe? (php+pgsql)

After years of Zabbix I am still unable to like it. I was begging them for years to convert the time-series backend into an abstraction to be able to use whatever fast and suitable TS system (there's plenty, from clickhouse through carbon to tsdb and even prometheus itself), but they still kind of trying to get it done and choose elastic as the first PoC target which is a bit disappointing. Without it the TS handling is slow and clunky. (Disclaimer: And yeah, I'm racist and it's in php, eeek.)

I have tried to reply this ticket meaningfully but I had to realise I can't really offer The Solution. Most people are using prometheus, indeed. I fancy modern, horizontally and vertically scalable TS backends (I'm playing with clickhouse but even elasticsearch, as wasteful as it is, is pretty good), but it's mainly for monitoring, not for alerting. I am still hesitating between nagios and prometheus for alerting, but just seen netdata the other day and it's pretty convincing realtime stuff with optional TS backend support and automagic clustering AND alerting. The last week or so I was playing with netdata, testing the collection and trying to squeeze it into clickhouse, still in the middle of it, so I can't suggest anything as of now.

grinapo avatar Feb 11 '20 22:02 grinapo

Right now all of our monitoring is very server-oriented. This fits with Munin's model of querying a server every 5 minutes. Whatever we should pick should have support for more service-based metrics that could come from sources like logfiles. We could also use integration with metrics from our cloud providers like S3 usage, as well as sampling at different rates for metrics that are slow to generate.

I know all of the above is possible with Munin, but it's not a natural fit.

pnorman avatar Feb 23 '20 04:02 pnorman

Is Promotheus set to replace Munin, and if so, are there plans to provide a public dashboard with similar coverage in Grafana?

mmd-osm avatar Sep 26 '20 11:09 mmd-osm

It is being evaluated. Decisions on these questions will naturally follow once the evaluation is complete.

tomhughes avatar Sep 26 '20 11:09 tomhughes