AM: crash on handleOverSizedMessages
Describe the bug
Alertmanager pods are getting restarted
Full logs https://gist.github.com/shuker85/b2c6eb98174ab56bb247bc757b0370c4
To Reproduce Steps to reproduce the behavior:
- Start Cortex 1.8.0
- Perform Operations(Read/Write/Others)
Expected behavior No restarts
Environment:
- Infrastructure: Kubernetes v1.20
- Deployment tool: kustomize
Storage Engine
- [ ] Blocks
- [x] Chunks
Additional Context
Pod config:
apiVersion: v1
kind: Pod
metadata:
annotations:
prometheus.io/path: /metrics
prometheus.io/port: "80"
prometheus.io/scrape: "true"
creationTimestamp: "2022-01-25T13:46:34Z"
generateName: alert-manager-
labels:
controller-revision-hash: alert-manager-89856fd9d
name: alert-manager
statefulset.kubernetes.io/pod-name: alert-manager-0
name: alert-manager-0
namespace: cortex
spec:
containers:
- args:
- -alertmanager-storage.gcs.service-account=/var/secrets/google/credentials.json
- -alertmanager.cluster.gossip-interval=500ms
- -alertmanager.cluster.listen-address=0.0.0.0:9094
- -alertmanager.cluster.peer-timeout=5s
- -alertmanager.cluster.peers=alert-manager-0.alert-manager-headless:9094
- -alertmanager.cluster.peers=alert-manager-1.alert-manager-headless:9094
- -alertmanager.cluster.peers=alert-manager-2.alert-manager-headless:9094
- -alertmanager.cluster.push-pull-interval=5s
- -alertmanager.sharding-enabled=true
- -alertmanager.sharding-ring.consul.hostname=consul:8500
- -alertmanager.storage.gcs.bucketname=xxxyyyzzz
- -alertmanager.storage.type=gcs
- -alertmanager.web.external-url=/api/prom/alertmanager
- -experimental.alertmanager.enable-api=true
- -log.level=warn
- -server.http-listen-port=80
- -target=alertmanager
- -tenant-federation.enabled=true
env:
- name: GOOGLE_APPLICATION_CREDENTIALS
value: /var/secrets/google/credentials.json
image: quay.io/cortexproject/cortex:v1.8.0
imagePullPolicy: IfNotPresent
name: alert-manager
ports:
- containerPort: 80
protocol: TCP
readinessProbe:
failureThreshold: 1
httpGet:
path: /ready
port: 80
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 5
successThreshold: 2
timeoutSeconds: 3
resources:
limits:
memory: 3Gi
requests:
cpu: 100m
memory: 3Gi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/secrets/google
name: service-account-secret
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-jl2rf
readOnly: true
dnsConfig:
options:
- name: ndots
value: "1"
dnsPolicy: ClusterFirst
enableServiceLinks: true
hostname: alert-manager-0
nodeName: eur-standard-5ah5-c535d04c-p4c4
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
subdomain: alert-manager-headless
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: service-account-secret
secret:
defaultMode: 420
optional: false
secretName: service-account-secret
- name: default-token-jl2rf
secret:
defaultMode: 420
secretName: default-token-jl2rf
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.
Revive pls
From the logs i cannot see exactly what caused the panic. can u add the log line that shows what caused the panic?
There are ~37k lines in the log, but not sure if it captures the beginning of the issue
created by github.com/prometheus/alertmanager/cluster.NewChannel
/__w/cortex/cortex/vendor/github.com/prometheus/alertmanager/cluster/channel.go:92 +0x7d8
goroutine 17559 [select]:
github.com/prometheus/alertmanager/silence.(*Silences).Maintenance(0xc02c48a540, 0xd18c2e2800, 0xc02b506060, 0x19, 0xc0287acf00)
/__w/cortex/cortex/vendor/github.com/prometheus/alertmanager/silence/silence.go:374 +0x150
github.com/cortexproject/cortex/pkg/alertmanager.New.func1(0xc0263efe00, 0xc0286c7720, 0xc0287c1580, 0x14)
/__w/cortex/cortex/pkg/alertmanager/alertmanager.go:152 +0xb7
created by github.com/cortexproject/cortex/pkg/alertmanager.New
/__w/cortex/cortex/pkg/alertmanager/alertmanager.go:151 +0x9d3
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.
Still happening very frequently. Any workarounds ?
We were able to find the first log that caused the issue:
level=warn ts=2022-11-23T07:55:30.808902225Z caller=delegate.go:218 component=cluster received="unknown state key" len=73972 key=sil:MDBA0002299
level=warn ts=2022-11-23T07:55:30.808909008Z caller=delegate.go:218 component=cluster received="unknown state key" len=73972 key=sil:MDBA0001897
level=warn ts=2022-11-23T07:55:30.808918306Z caller=delegate.go:218 component=cluster received="unknown state key" len=73972 key=sil:MDBA0003337
level=warn ts=2022-11-23T07:55:30.808928138Z caller=delegate.go:218 component=cluster received="unknown state key" len=73972 key=nfl:MDBA0002024
level=warn ts=2022-11-23T07:55:30.808934729Z caller=delegate.go:218 component=cluster received="unknown state key" len=73972 key=sil:MDBA0003438
level=warn ts=2022-11-23T07:55:30.80894122Z caller=delegate.go:218 component=cluster received="unknown state key" len=73972 key=sil:MDBA0001870
fatal error: concurrent map read and map write
goroutine 8713 [running]:
runtime.throw(0x2640e9e, 0x21)
/usr/local/go/src/runtime/panic.go:1116 +0x72 fp=0xc012566a28 sp=0xc0125669f8 pc=0x435072
runtime.mapaccess2_faststr(0x21fc420, 0xc000a51170, 0xc00dda0730, 0xc, 0x0, 0x0)
/usr/local/go/src/runtime/map_faststr.go:116 +0x47c fp=0xc012566a98 sp=0xc012566a28 pc=0x4132fc
github.com/prometheus/alertmanager/cluster.(*delegate).MergeRemoteState(0xc00012eea0, 0xc012a3a000, 0x120f4, 0x120f4, 0x0)
/__w/cortex/cortex/vendor/github.com/prometheus/alertmanager/cluster/delegate.go:216 +0x2df fp=0xc012566c30 sp=0xc012566a98 pc=0xdbff1f
github.com/hashicorp/memberlist.(*Memberlist).mergeRemoteState(0xc0002ccdc0, 0x2cce400, 0xc0096c7180, 0x4, 0x4, 0xc012a3a000, 0x120f4, 0x120f4, 0xc012a3a000, 0x120f4)
/__w/cortex/cortex/vendor/github.com/hashicorp/memberlist/net.go:1174 +0x3dd fp=0xc012566d68 sp=0xc012566c30 pc=0xd7e5dd
github.com/hashicorp/memberlist.(*Memberlist).handleConn(0xc0002ccdc0, 0x2cce460, 0xc0064c4b70)
/__w/cortex/cortex/vendor/github.com/hashicorp/memberlist/net.go:277 +0xc6e fp=0xc012566fc8 sp=0xc012566d68 pc=0xd76aee
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1373 +0x1 fp=0xc012566fd0 sp=0xc012566fc8 pc=0x467f31
created by github.com/hashicorp/memberlist.(*Memberlist).streamListen
/__w/cortex/cortex/vendor/github.com/hashicorp/memberlist/net.go:213 +0x6a
Looking at the code it relates to https://github.com/prometheus/alertmanager/blob/4c6c03ebfe21009c546e4d1e9b92c371d67c021d/cluster/cluster.go#L216
Which is not changed in the recent releases (v0.21 vs v0.24) or AlertManager
ml, err := memberlist.Create(cfg)
if err != nil {
return nil, errors.Wrap(err, "create memberlist")
}
p.mlist = ml
return p, nil
I will try to take abloom into this next week :)
This should be fixed by upstream AM already https://github.com/prometheus/alertmanager/pull/2543. Cortex also includes this fix since we are with AM v0.24. If you are still on old Cortex version v1.8.0 please upgrade to a newer version and try again.
Ok, let's close this for now. Will plan upgrade to Cortex 1.13.x