alertmanager
alertmanager copied to clipboard
Alertmanager pod msg="dropping messages because too many are queued"
hello,
What did you do? I'am using alertmanager in a kubernetes pod, it's connected to Prometheus, Karma and Kthnxbye to ack alerts.
What did you expect to see?
normal memory usage as before
What did you see instead?
Recently, the memory usage graph of alertmanager is experiencing a linear increase.
In the alertmanager logs, I have this message:
level=warn ts=2020-12-17T09:32:04.281Z caller=delegate.go:272 component=cluster msg="dropping messages because too many are queued" current=4100 limit=4096
Rule expression of the message :
// handleQueueDepth ensures that the queue doesn't grow unbounded by pruning
// older messages at regular interval.
func (d *delegate) handleQueueDepth() {
for {
select {
case <-d.stopc:
return
case <-time.After(15 * time.Minute):
n := d.bcast.NumQueued()
if n > maxQueueSize {
level.Warn(d.logger).Log("msg", "dropping messages because too many are queued", "current", n, "limit", maxQueueSize)
d.bcast.Prune(maxQueueSize)
d.messagesPruned.Add(float64(n - maxQueueSize))
}
}
}
}
Please note that there is no action to justify this increase.
Environment Alertmanager : v0.21.0 Prometheus : v2.18.2
It would mean that your instance can't keep up with replicating data with its peers. The alertmanager_cluster_health_score
metric would tell you about your cluster's health (the lower the better, 0 if everything's fine). You can look at the alertmanager_cluster_messages_queued
and alertmanager_cluster_messages_pruned_total
metrics too.
You may have to tune the --cluster.*
CLI flags.
thank you @simonpasquier , alertmanager_cluster_health_score value = 0 all time since the cluster started on the other hand, alertmanager_cluster_messages_queued and alertmanager_cluster_messages_pruned_total metrics show a linear increase, is this normal behavior?
I am noticing the same issues on single instance AM (version 0.23.0)
alertmanager_cluster_alive_messages_total{peer="01FKNQEQMADHPQF9HNAWV169DP"} 1
alertmanager_cluster_enabled 1
alertmanager_cluster_failed_peers 0
alertmanager_cluster_health_score 0
alertmanager_cluster_members 1
alertmanager_cluster_messages_pruned_total 973
alertmanager_cluster_messages_queued 4100
alertmanager_cluster_messages_received_size_total{msg_type="full_state"} 0
alertmanager_cluster_messages_received_size_total{msg_type="update"} 0
alertmanager_cluster_messages_received_total{msg_type="full_state"} 0
alertmanager_cluster_messages_received_total{msg_type="update"} 0
alertmanager_cluster_messages_sent_size_total{msg_type="full_state"} 0
alertmanager_cluster_messages_sent_size_total{msg_type="update"} 0
alertmanager_cluster_messages_sent_total{msg_type="full_state"} 0
alertmanager_cluster_messages_sent_total{msg_type="update"} 0
alertmanager_cluster_peer_info{peer="01FKNQEQMADHPQF9HNAWV169DP"} 1
alertmanager_cluster_peers_joined_total 1
alertmanager_cluster_peers_left_total 0
alertmanager_cluster_peers_update_total 0
alertmanager_cluster_reconnections_failed_total 0
alertmanager_cluster_reconnections_total 0
alertmanager_cluster_refresh_join_failed_total 0
alertmanager_cluster_refresh_join_total 0
Adding --cluster.listen-address=
to command line does work as a workaround.
Any fix given for the same? @simonpasquier