alertmanager icon indicating copy to clipboard operation
alertmanager copied to clipboard

Alertmanager pod msg="dropping messages because too many are queued"

Open nmizeb opened this issue 3 years ago • 4 comments

hello,

What did you do? I'am using alertmanager in a kubernetes pod, it's connected to Prometheus, Karma and Kthnxbye to ack alerts.

What did you expect to see?

normal memory usage as before

What did you see instead?

Recently, the memory usage graph of alertmanager is experiencing a linear increase. In the alertmanager logs, I have this message: level=warn ts=2020-12-17T09:32:04.281Z caller=delegate.go:272 component=cluster msg="dropping messages because too many are queued" current=4100 limit=4096 Rule expression of the message :

// handleQueueDepth ensures that the queue doesn't grow unbounded by pruning
// older messages at regular interval.
func (d *delegate) handleQueueDepth() {
	for {
		select {
		case <-d.stopc:
			return
		case <-time.After(15 * time.Minute):
			n := d.bcast.NumQueued()
			if n > maxQueueSize {
				level.Warn(d.logger).Log("msg", "dropping messages because too many are queued", "current", n, "limit", maxQueueSize)
				d.bcast.Prune(maxQueueSize)
				d.messagesPruned.Add(float64(n - maxQueueSize))
			}
		}
	}
}

Please note that there is no action to justify this increase.

Environment Alertmanager : v0.21.0 Prometheus : v2.18.2

nmizeb avatar Dec 23 '20 11:12 nmizeb

It would mean that your instance can't keep up with replicating data with its peers. The alertmanager_cluster_health_score metric would tell you about your cluster's health (the lower the better, 0 if everything's fine). You can look at the alertmanager_cluster_messages_queued and alertmanager_cluster_messages_pruned_total metrics too. You may have to tune the --cluster.* CLI flags.

simonpasquier avatar Dec 23 '20 14:12 simonpasquier

thank you @simonpasquier , alertmanager_cluster_health_score value = 0 all time since the cluster started on the other hand, alertmanager_cluster_messages_queued and alertmanager_cluster_messages_pruned_total metrics show a linear increase, is this normal behavior?

nmizeb avatar Jan 05 '21 12:01 nmizeb

I am noticing the same issues on single instance AM (version 0.23.0)

alertmanager_cluster_alive_messages_total{peer="01FKNQEQMADHPQF9HNAWV169DP"} 1
alertmanager_cluster_enabled 1
alertmanager_cluster_failed_peers 0
alertmanager_cluster_health_score 0
alertmanager_cluster_members 1
alertmanager_cluster_messages_pruned_total 973
alertmanager_cluster_messages_queued 4100
alertmanager_cluster_messages_received_size_total{msg_type="full_state"} 0
alertmanager_cluster_messages_received_size_total{msg_type="update"} 0
alertmanager_cluster_messages_received_total{msg_type="full_state"} 0
alertmanager_cluster_messages_received_total{msg_type="update"} 0
alertmanager_cluster_messages_sent_size_total{msg_type="full_state"} 0
alertmanager_cluster_messages_sent_size_total{msg_type="update"} 0
alertmanager_cluster_messages_sent_total{msg_type="full_state"} 0
alertmanager_cluster_messages_sent_total{msg_type="update"} 0
alertmanager_cluster_peer_info{peer="01FKNQEQMADHPQF9HNAWV169DP"} 1
alertmanager_cluster_peers_joined_total 1
alertmanager_cluster_peers_left_total 0
alertmanager_cluster_peers_update_total 0
alertmanager_cluster_reconnections_failed_total 0
alertmanager_cluster_reconnections_total 0
alertmanager_cluster_refresh_join_failed_total 0
alertmanager_cluster_refresh_join_total 0

Adding --cluster.listen-address= to command line does work as a workaround.

baryluk avatar Nov 23 '21 15:11 baryluk

Any fix given for the same? @simonpasquier

KeyanatGiggso avatar Jul 07 '22 05:07 KeyanatGiggso