cortex AM: crash on handleOverSizedMessages

Describe the bug

Alertmanager pods are getting restarted

Full logs https://gist.github.com/shuker85/b2c6eb98174ab56bb247bc757b0370c4

To Reproduce Steps to reproduce the behavior:

Start Cortex 1.8.0
Perform Operations(Read/Write/Others)

Expected behavior No restarts

Environment:

Infrastructure: Kubernetes v1.20
Deployment tool: kustomize

Storage Engine

[ ] Blocks
[x] Chunks

Additional Context

Pod config:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    prometheus.io/path: /metrics
    prometheus.io/port: "80"
    prometheus.io/scrape: "true"
  creationTimestamp: "2022-01-25T13:46:34Z"
  generateName: alert-manager-
  labels:
    controller-revision-hash: alert-manager-89856fd9d
    name: alert-manager
    statefulset.kubernetes.io/pod-name: alert-manager-0
  name: alert-manager-0
  namespace: cortex
spec:
  containers:
  - args:
    - -alertmanager-storage.gcs.service-account=/var/secrets/google/credentials.json
    - -alertmanager.cluster.gossip-interval=500ms
    - -alertmanager.cluster.listen-address=0.0.0.0:9094
    - -alertmanager.cluster.peer-timeout=5s
    - -alertmanager.cluster.peers=alert-manager-0.alert-manager-headless:9094
    - -alertmanager.cluster.peers=alert-manager-1.alert-manager-headless:9094
    - -alertmanager.cluster.peers=alert-manager-2.alert-manager-headless:9094
    - -alertmanager.cluster.push-pull-interval=5s
    - -alertmanager.sharding-enabled=true
    - -alertmanager.sharding-ring.consul.hostname=consul:8500
    - -alertmanager.storage.gcs.bucketname=xxxyyyzzz
    - -alertmanager.storage.type=gcs
    - -alertmanager.web.external-url=/api/prom/alertmanager
    - -experimental.alertmanager.enable-api=true
    - -log.level=warn
    - -server.http-listen-port=80
    - -target=alertmanager
    - -tenant-federation.enabled=true
    env:
    - name: GOOGLE_APPLICATION_CREDENTIALS
      value: /var/secrets/google/credentials.json
    image: quay.io/cortexproject/cortex:v1.8.0
    imagePullPolicy: IfNotPresent
    name: alert-manager
    ports:
    - containerPort: 80
      protocol: TCP
    readinessProbe:
      failureThreshold: 1
      httpGet:
        path: /ready
        port: 80
        scheme: HTTP
      initialDelaySeconds: 5
      periodSeconds: 5
      successThreshold: 2
      timeoutSeconds: 3
    resources:
      limits:
        memory: 3Gi
      requests:
        cpu: 100m
        memory: 3Gi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/secrets/google
      name: service-account-secret
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-jl2rf
      readOnly: true
  dnsConfig:
    options:
    - name: ndots
      value: "1"
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostname: alert-manager-0
  nodeName: eur-standard-5ah5-c535d04c-p4c4
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  subdomain: alert-manager-headless
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: service-account-secret
    secret:
      defaultMode: 420
      optional: false
      secretName: service-account-secret
  - name: default-token-jl2rf
    secret:
      defaultMode: 420
      secretName: default-token-jl2rf

Mar 24 '22 08:03 shuker85

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

Jun 23 '22 01:06 stale[bot]

Revive pls

Aug 01 '22 12:08 shuker85

From the logs i cannot see exactly what caused the panic. can u add the log line that shows what caused the panic?

Aug 03 '22 02:08 alanprot

There are ~37k lines in the log, but not sure if it captures the beginning of the issue

created by github.com/prometheus/alertmanager/cluster.NewChannel
	/__w/cortex/cortex/vendor/github.com/prometheus/alertmanager/cluster/channel.go:92 +0x7d8

goroutine 17559 [select]:
github.com/prometheus/alertmanager/silence.(*Silences).Maintenance(0xc02c48a540, 0xd18c2e2800, 0xc02b506060, 0x19, 0xc0287acf00)
	/__w/cortex/cortex/vendor/github.com/prometheus/alertmanager/silence/silence.go:374 +0x150
github.com/cortexproject/cortex/pkg/alertmanager.New.func1(0xc0263efe00, 0xc0286c7720, 0xc0287c1580, 0x14)
	/__w/cortex/cortex/pkg/alertmanager/alertmanager.go:152 +0xb7
created by github.com/cortexproject/cortex/pkg/alertmanager.New
	/__w/cortex/cortex/pkg/alertmanager/alertmanager.go:151 +0x9d3

am.log

Aug 03 '22 06:08 shuker85

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

Nov 01 '22 21:11 stale[bot]

Still happening very frequently. Any workarounds ?

Nov 23 '22 07:11 shuker85

We were able to find the first log that caused the issue:

level=warn ts=2022-11-23T07:55:30.808902225Z caller=delegate.go:218 component=cluster received="unknown state key" len=73972 key=sil:MDBA0002299
level=warn ts=2022-11-23T07:55:30.808909008Z caller=delegate.go:218 component=cluster received="unknown state key" len=73972 key=sil:MDBA0001897
level=warn ts=2022-11-23T07:55:30.808918306Z caller=delegate.go:218 component=cluster received="unknown state key" len=73972 key=sil:MDBA0003337
level=warn ts=2022-11-23T07:55:30.808928138Z caller=delegate.go:218 component=cluster received="unknown state key" len=73972 key=nfl:MDBA0002024
level=warn ts=2022-11-23T07:55:30.808934729Z caller=delegate.go:218 component=cluster received="unknown state key" len=73972 key=sil:MDBA0003438
level=warn ts=2022-11-23T07:55:30.80894122Z caller=delegate.go:218 component=cluster received="unknown state key" len=73972 key=sil:MDBA0001870
fatal error: concurrent map read and map write

goroutine 8713 [running]:
runtime.throw(0x2640e9e, 0x21)
	/usr/local/go/src/runtime/panic.go:1116 +0x72 fp=0xc012566a28 sp=0xc0125669f8 pc=0x435072
runtime.mapaccess2_faststr(0x21fc420, 0xc000a51170, 0xc00dda0730, 0xc, 0x0, 0x0)
	/usr/local/go/src/runtime/map_faststr.go:116 +0x47c fp=0xc012566a98 sp=0xc012566a28 pc=0x4132fc
github.com/prometheus/alertmanager/cluster.(*delegate).MergeRemoteState(0xc00012eea0, 0xc012a3a000, 0x120f4, 0x120f4, 0x0)
	/__w/cortex/cortex/vendor/github.com/prometheus/alertmanager/cluster/delegate.go:216 +0x2df fp=0xc012566c30 sp=0xc012566a98 pc=0xdbff1f
github.com/hashicorp/memberlist.(*Memberlist).mergeRemoteState(0xc0002ccdc0, 0x2cce400, 0xc0096c7180, 0x4, 0x4, 0xc012a3a000, 0x120f4, 0x120f4, 0xc012a3a000, 0x120f4)
	/__w/cortex/cortex/vendor/github.com/hashicorp/memberlist/net.go:1174 +0x3dd fp=0xc012566d68 sp=0xc012566c30 pc=0xd7e5dd
github.com/hashicorp/memberlist.(*Memberlist).handleConn(0xc0002ccdc0, 0x2cce460, 0xc0064c4b70)
	/__w/cortex/cortex/vendor/github.com/hashicorp/memberlist/net.go:277 +0xc6e fp=0xc012566fc8 sp=0xc012566d68 pc=0xd76aee
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1373 +0x1 fp=0xc012566fd0 sp=0xc012566fc8 pc=0x467f31
created by github.com/hashicorp/memberlist.(*Memberlist).streamListen
	/__w/cortex/cortex/vendor/github.com/hashicorp/memberlist/net.go:213 +0x6a

Nov 23 '22 08:11 mariadb-MihailMihaylov

Looking at the code it relates to https://github.com/prometheus/alertmanager/blob/4c6c03ebfe21009c546e4d1e9b92c371d67c021d/cluster/cluster.go#L216

Which is not changed in the recent releases (v0.21 vs v0.24) or AlertManager

	ml, err := memberlist.Create(cfg)
	if err != nil {
		return nil, errors.Wrap(err, "create memberlist")
	}
	p.mlist = ml
	return p, nil

Nov 27 '22 15:11 shuker85

I will try to take abloom into this next week :)

Nov 27 '22 17:11 alanprot

This should be fixed by upstream AM already https://github.com/prometheus/alertmanager/pull/2543. Cortex also includes this fix since we are with AM v0.24. If you are still on old Cortex version v1.8.0 please upgrade to a newer version and try again.

Nov 28 '22 03:11 yeya24

Ok, let's close this for now. Will plan upgrade to Cortex 1.13.x

Nov 28 '22 08:11 shuker85