alertmanager icon indicating copy to clipboard operation
alertmanager copied to clipboard

Optimise alert ingest path

Open gouthamve opened this issue 6 years ago • 13 comments

From @brancz : While benchmarking & profiling Alertmanager it quickly became obvious, that the ingest path blocks a lot

Can you share some numbers about lock wait times?

gouthamve avatar Jan 16 '18 15:01 gouthamve

Need to run the benchmarks again, for exact numbers. I remember that the channel alerts are sent to in the dispatcher had the longest wait times. Increasing the channel buffer improved this a bit, but obviously only pushes out the issue.

A question we need to ask ourselves as well is, what kind of load are we expecting Alertmanager to handle? (nonetheless the limit should be resource bound not a technical limitation)

brancz avatar Jan 16 '18 16:01 brancz

I think a million active alerts is a reasonable starting point. See also https://github.com/prometheus/prometheus/issues/2585

brian-brazil avatar Jan 16 '18 16:01 brian-brazil

Hey! I am a graduate student who want to apply GSoC this summer. I previously have some Go experience and I am seeking a Go performance optimization related project.

The problem here is the dispatcher can not send out alert messages to all kinds of client efficiently, or data can not write into dispatcher efficiently?

starsdeep avatar Mar 02 '18 05:03 starsdeep

It's everything before the dispatcher that needs optimisation.

brian-brazil avatar Mar 02 '18 06:03 brian-brazil

I think a million active alerts is a reasonable starting point.

What sort of request rate are you expecting for these 1 million alerts? In what little testing I've done, AM had no issue maintaining several thousands of active alerts (< 50,000), but had issues processing incoming alert batches when the rate hit ~30 requests/sec (the issue could occur at a lower rate, this was just the rate I noticed at which AM locked up).

stuartnelson3 avatar Mar 02 '18 10:03 stuartnelson3

With a 10s eval interval it'd be 100k alerts/s which with a batch size of 64 would be ~1.5k requests/s. https://github.com/prometheus/prometheus/issues/2585 can bring that down to ~250/s.

brian-brazil avatar Mar 02 '18 10:03 brian-brazil

Hey, I tried to use Go pprof to profiling AlertManager, but this seems can only help us to locate some inefficient function implementation. The whole workflow of AlertManager includes: deduplicating, grouping, inhibition, and routing, if we want to locate which stage is the bottleneck in high concurrency case, seems we need manually write code to track and time functions?

pprof005

@brancz Do you have some benchmark code to share?

starsdeep avatar Mar 03 '18 23:03 starsdeep

@starsdeep I had previously built and used this, I remember having taken mutex profiles and seen a lot of lock contention starting at the dispatcher.

brancz avatar Mar 05 '18 09:03 brancz

@brancz I tried to use the ambench code to launch a benchmark, there are some goroutine block events though, there is no mutex contention events:

jj

PS: I launch AlertManager instances using "goreman start" with "DEBUG=true" environment variable, and run load test with ./ambench -alertmanagers=http://localhost:9093,http://localhost:9094,http://localhost:9095

starsdeep avatar Mar 06 '18 21:03 starsdeep

what does the block profile look like? :slightly_smiling_face:

brancz avatar Mar 07 '18 06:03 brancz

@brancz

001 002

starsdeep avatar Mar 07 '18 18:03 starsdeep

I would look closer at that Dispatcher 90s that's in the ingest path of alerts being added through the API.

brancz avatar Mar 08 '18 17:03 brancz

I think a million active alerts is a reasonable starting point.

What sort of request rate are you expecting for these 1 million alerts? In what little testing I've done, AM had no issue maintaining several thousands of active alerts (< 50,000), but had issues processing incoming alert batches when the rate hit ~30 requests/sec (the issue could occur at a lower rate, this was just the rate I noticed at which AM locked up).

@stuartnelson3 hi, about "AM had no issue maintaining several thousands of active alerts (< 50,000)", Do you have the corresponding benchmark data?

glidea avatar Sep 01 '22 13:09 glidea