consul-alerts icon indicating copy to clipboard operation
consul-alerts copied to clipboard

Scale of consul-alerts

Open ronmb opened this issue 9 years ago • 8 comments

Does anyone have any numbers of the number of servers and service check alerts that consul-alerts can process? We have an infrastructure in the 4 digit number of servers with each server running about 8-10 service checks each.

I was curious if anyone had experience any latency in processing and receiving the notification for the either the down host or service. Currently consul-alert takes approximately up to half-hour to an hour for it to just start-up. Is this normal?

Does consul-alerts provide any metrics that we can extract to show it's performance?

ronmb avatar Apr 28 '15 20:04 ronmb

Haven't tried consul-alerts on this scale before. How many instances of consul-alerts is running? I think the slow-down might be caused by the sheer volume of checks being processed. There are no metrics at the moment but I'm keen on finding out how to scale it to such a size.

darkcrux avatar May 18 '15 08:05 darkcrux

After messing around with consul-alerts it seems like https://github.com/AcalephStorage/consul-alerts/blob/master/consul/client.go#L223-L257 would severely limit performance. In a smaller datacenter with lower 4 digit checks the loop appears to take a minute. This causes https://github.com/AcalephStorage/consul-alerts/blob/master/check-handler.go#L57-L61 to take longer than expected and generally slow everything down.

macb avatar Oct 14 '15 00:10 macb

@macb , I have similar situation, I'm running consul-alerts in ~70 servers, ~5 checks in each. https://github.com/AcalephStorage/consul-alerts/blob/master/consul/client.go#L223-L257 is taking at ~1 minute, so it takes ~7 minutes for https://github.com/AcalephStorage/consul-alerts/blob/master/check-handler.go#L57-L61 to run. Any ideas to improve this? Thanks.

akmalabbasov avatar Nov 30 '15 11:11 akmalabbasov

I am using this for about 100 servers in AWS and definitely have noticed some inefficiencies and high cpu demands. I am being pretty ambitious and have servers with about 20 checks running. I have to use compute optimized c4 instances to be able to run at this scale and I assume that that the strain is going to increase. However, I think this is great project and I will try to dig into the code and help where I can as well.

mar-io avatar Dec 24 '15 06:12 mar-io

There are few factors affecting performance.

  1. Consul-Alerts is dependent on the "watcher" feature of Consul. This watches the health checks for changes. Any status change triggers consul to send the entire list of health checks from all nodes. This is what consul-alerts processes. (eg. 70 servers * 5 checks = 350 checks to check every time a change is detected).
  • still thinking of a way to just get the changed health check instead of all
  1. The code processes the checks + sending notifications in a linear way.
  • go routines might speed things up

darkcrux avatar Jan 04 '16 09:01 darkcrux

Any updates on this? I have a large deployment that's becoming Consul aware, and I'd love to use consul-alerts for notifications. Expected stats: ~200 servers, ~700 services, ~3000 total health checks.

ariscn avatar Jan 27 '16 22:01 ariscn

I feel like the real fix for this needs to come from upstream in consul. There is a ticket, I can't find right now, to have consul return only changed entries in the watches. That would be the ideal fix, with maybe full comparisons occasionally run for a sanity check.

fusiondog avatar Mar 01 '16 21:03 fusiondog

I noticed consul-alerts takes considerable resources on our server and consul itself gets very busy with writes when we have consul running so I traced it for a minute in our test environment and made these observations:

The main issue seems to be all the writes it produces e.g. it seems like it loops over every check and re-writes the content each time even though those contents likely didn't change:

count URL prefix 680 PUT /v1/kv/consul-alerts/checks

count URL prefix 136 PUT /v1/kv/consul-alerts/checks/ecs-1269316829 120 PUT /v1/kv/consul-alerts/checks/ecs-205916921 104 PUT /v1/kv/consul-alerts/checks/ecs-2743417484 104 PUT /v1/kv/consul-alerts/checks/ecs-3237410996 72 PUT /v1/kv/consul-alerts/checks/node1 72 PUT /v1/kv/consul-alerts/checks/node2 72 PUT /v1/kv/consul-alerts/checks/node3

At a minimum it's already reading the contents of this on every loop so it should know if the content has changed. Doing these writes each time seems to have the largest overhead on consul

Some other observations in this capture:

688 calls to reminders prefix. GET /v1/kv/consul-alerts/reminders

Given we have nothing down that prefix we could just do /v1/kv/consul-alerts/reminders?recurse in one shot

consul-alerts is doing 1375 calls to checks GET /v1/kv/consul-alerts/checks

we seem to do this in more than one place and could make sense to do this in one shot.

We are doing 2674 calls into config GET /v1/kv/consul-alerts/config/checks/blacklist

again one shot would be better as we have nothing in the blacklist

A couple of these changes could greatly reduce the overhead of consul-alerts running

rhuddleston avatar Dec 22 '16 00:12 rhuddleston