cortex-helm-chart
cortex-helm-chart copied to clipboard
change alertmanager-svc-headless from http to grpc port + expose 9094 TCP and UDP
What this PR does: This PR correct the alertmanager-headless service with exposing grpc port instead of http.
cortex-alertmanager-headless ClusterIP None <none> 9095/TCP 1d
cortex-distributor-headless ClusterIP None <none> 9095/TCP 1d
cortex-ingester-headless ClusterIP None <none> 9095/TCP 1d
cortex-query-frontend-headless ClusterIP None <none> 9095/TCP 1d
cortex-store-gateway-headless ClusterIP None <none> 9095/TCP 1d
This PR also expose port 9094 TCP and UDP for gossip cluster in alertmanager statefulset.
# k describe pods/cortex-base-alertmanager-0 -n cortex-base
alertmanager:
Ports: 8080/TCP, 7946/TCP, 9095/TCP, 9094/TCP, 9094/UDP
Which issue(s) this PR fixes: This PR does not fix an issue BUT is related to conversation in #420
2# - why there is no grpc port exposed ? For example ingesters do have grpc exposed
Just missing. No real reason I guess. See explanation below.
Checklist
- [x]
CHANGELOG.mdupdated - the order of entries should be[CHANGE],[FEATURE],[ENHANCEMENT],[BUGFIX]
Hello, it would be great if we could add some logic for populating the peers list. I think it would look somthing like this(not tested):
args:
- "-target=alertmanager"
- "-config.file=/etc/cortex/cortex.yaml"
{{- if gt (int .Values.alertmanager.replicas) 1}}
{{- $fullName := include "cortex.alertmanagerFullname" . }}
{{- range $i := until (int .Values.alertmanager.replicas) }}
- --cluster.peer={{ $fullName }}-{{ $i }}.{{ $fullName }}-headless:9094
{{- end }}
This looks to be the way that the prometheus-community alertmanager is handling peers. Source
Hello, it would be great if we could add some logic for populating the peers list. I think it would look somthing like this(not tested):
args: - "-target=alertmanager" - "-config.file=/etc/cortex/cortex.yaml" {{- if gt (int .Values.alertmanager.replicas) 1}} {{- $fullName := include "cortex.alertmanagerFullname" . }} {{- range $i := until (int .Values.alertmanager.replicas) }} - --cluster.peer={{ $fullName }}-{{ $i }}.{{ $fullName }}-headless:9094 {{- end }}This looks to be the way that the prometheus-community alertmanager is handling peers. Source
Is seeding the cluster peers necessary when we have memberlist enabled?
Is seeding the cluster peers necessary when we have memberlist enabled?
I believe gossip memberlist and gossip cluster are two separate things here, please correct me if I am wrong but I asked in slack channel about that specific question : https://cloud-native.slack.com/archives/CCYDASBLP/p1676555646043829
Is seeding the cluster peers necessary when we have memberlist enabled?
I believe gossip memberlist and gossip cluster are two separate things here, please correct me if I am wrong but I asked in slack channel about that specific question : https://cloud-native.slack.com/archives/CCYDASBLP/p1676555646043829
Pretty sure you are correct with this assessment. That's probably also why it's giving us so much grief.
Is seeding the cluster peers necessary when we have memberlist enabled?
I believe gossip memberlist and gossip cluster are two separate things here, please correct me if I am wrong but I asked in slack channel about that specific question : https://cloud-native.slack.com/archives/CCYDASBLP/p1676555646043829
Pretty sure you are correct with this assessment. That's probably also why it's giving us so much grief.
I redeployed my alertmanagers by removing peers and result is at follow :
# values.yaml shipped to helm
config
alertmanager:
cluster:
listen_address: '0.0.0.0:9094'
alertmanager:
replicas: 2
# /multitenant_alertmanager/status
<h3>Members</h3>
Name | Addr
-- | --
01GVK00A2JF0H9DK4FYXKTCGXR | 172.17.0.13
# k get pods
cortex-alertmanager-0 2/2 Running 0 6m35s 172.17.0.18 minikube <none> <none>
cortex-alertmanager-1 2/2 Running 0 6m47s 172.17.0.13 minikube <none> <none>
so we indeed need to seed peers (too many "ee" in that phrase) and the two available options are :
- array of alertmanager pod address (prometheus recommend)
- alertmanager-headless service
to reply to @dpericaxon comment
Hello, it would be great if we could add some logic for populating the peers list. I think it would look something like this(not tested)
Isn't this a templating values.yaml problem rather than a pod/args templating problem since we are passing peers to the alertmanager via /etc/cortex/cortex.yaml ?
Args:
-target=alertmanager
-config.file=/etc/cortex/cortex.yaml
Where are we with this?
I've consolidated the proposed changes here #460
Feel free to try it out and let me know if that works.
I've consolidated the proposed changes here #460
Feel free to try it out and let me know if that works.
I am planning to come back on that issue next week, I will try #460 and keep posted.
I've consolidated the proposed changes here #460 Feel free to try it out and let me know if that works.
I am planning to come back on that issue next week, I will try #460 and keep posted.
#460 has been implemented with comma separated list in https://github.com/cortexproject/cortex-helm-chart/pull/435/commits/1d830da0f1d2e96d6074f9b8d07ca051b886dc08
@nschad / @kd7lxl - I thought to be mostly done here but I noticed a very strange behavior :
each replica of my alertmanager cluster is debugging about not finding old replica in cluster.go :
# k describe pods/cortex-base-alertmanager-0 -n cortex-base
level=debug ts=2023-07-03T16:13:40.509653841Z caller=cluster.go:441 component=cluster msg=reconnect result=failure peer=01H4E5FTYCW0FVZW78X181J77S addr=10.244.1.4:9094 err="1 error occurred:\n\t* Failed to join 10.244.1.4: dial tcp 10.244.1.4:9094: connect: no route to host\n\n"
level=debug ts=2023-07-03T16:15:20.50957654Z caller=cluster.go:339 component=cluster memberlist="2023/07/03 16:15:20 [DEBUG] memberlist: Failed to join 10.244.1.4: dial tcp 10.244.1.4:9094: connect: no route to host\n"
but my current peers are at follow :
# k describe pods/cortex-base-alertmanager-2 -n cortex-base | grep peer
-alertmanager.cluster.peers=cortex-base-alertmanager-0.cortex-base-alertmanager-headless.cortex-base.svc.cluster.local:9094,cortex-base-alertmanager-1.cortex-base-alertmanager-headless.cortex-base.svc.cluster.local:9094,cortex-base-alertmanager-2.cortex-base-alertmanager-headless.cortex-base.svc.cluster.local:9094
# http://<forwarded-am-service>:<port>/multitenant_alertmanager/status
01H4E5G6WY5AWVHT8AAE1GB31A 10.244.1.5
01H4E85YE1XWTA82HQ24T3733D 10.244.1.8
01H4E81T31RJ87MKW35J84RJSN 10.244.1.7
I don't know how to explain that :confused: (yet), I am also very confused on that port 9094 that should be only cluster related and not memberlist. I may post that on Slack let's see.
@nschad / @kd7lxl - I thought to be mostly done here but I noticed a very strange behavior :
each replica of my alertmanager cluster is debugging about not finding old replica in cluster.go :
# k describe pods/cortex-base-alertmanager-0 -n cortex-base level=debug ts=2023-07-03T16:13:40.509653841Z caller=cluster.go:441 component=cluster msg=reconnect result=failure peer=01H4E5FTYCW0FVZW78X181J77S addr=10.244.1.4:9094 err="1 error occurred:\n\t* Failed to join 10.244.1.4: dial tcp 10.244.1.4:9094: connect: no route to host\n\n" level=debug ts=2023-07-03T16:15:20.50957654Z caller=cluster.go:339 component=cluster memberlist="2023/07/03 16:15:20 [DEBUG] memberlist: Failed to join 10.244.1.4: dial tcp 10.244.1.4:9094: connect: no route to host\n"but my current peers are at follow :
# k describe pods/cortex-base-alertmanager-2 -n cortex-base | grep peer -alertmanager.cluster.peers=cortex-base-alertmanager-0.cortex-base-alertmanager-headless.cortex-base.svc.cluster.local:9094,cortex-base-alertmanager-1.cortex-base-alertmanager-headless.cortex-base.svc.cluster.local:9094,cortex-base-alertmanager-2.cortex-base-alertmanager-headless.cortex-base.svc.cluster.local:9094 # http://<forwarded-am-service>:<port>/multitenant_alertmanager/status 01H4E5G6WY5AWVHT8AAE1GB31A 10.244.1.5 01H4E85YE1XWTA82HQ24T3733D 10.244.1.8 01H4E81T31RJ87MKW35J84RJSN 10.244.1.7I don't know how to explain that :confused: (yet), I am also very confused on that port 9094 that should be only cluster related and not memberlist. I may post that on Slack let's see.
Might be normal that the ring has old instances in its ring temporarily. Do these logs disappear after a few minutes?
Might be normal that the ring has old instances in its ring temporarily. Do these logs disappear after a few minutes?
In my case it seems to last abnormally long. I am curious to know if someone else have the same behavior. It is in debug logs thoo, maybe it is something I haven't noticed in the past too .. I am currently unsure.
This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.