cortex-helm-chart icon indicating copy to clipboard operation
cortex-helm-chart copied to clipboard

change alertmanager-svc-headless from http to grpc port + expose 9094 TCP and UDP

Open humblebundledore opened this issue 2 years ago • 13 comments

What this PR does: This PR correct the alertmanager-headless service with exposing grpc port instead of http.

cortex-alertmanager-headless     ClusterIP   None             <none>        9095/TCP    1d
cortex-distributor-headless      ClusterIP   None             <none>        9095/TCP    1d
cortex-ingester-headless         ClusterIP   None             <none>        9095/TCP    1d
cortex-query-frontend-headless   ClusterIP   None             <none>        9095/TCP    1d
cortex-store-gateway-headless    ClusterIP   None             <none>        9095/TCP    1d

This PR also expose port 9094 TCP and UDP for gossip cluster in alertmanager statefulset.

# k describe pods/cortex-base-alertmanager-0 -n cortex-base
  alertmanager:
    Ports:         8080/TCP, 7946/TCP, 9095/TCP, 9094/TCP, 9094/UDP

Which issue(s) this PR fixes: This PR does not fix an issue BUT is related to conversation in #420

2# - why there is no grpc port exposed ? For example ingesters do have grpc exposed

Just missing. No real reason I guess. See explanation below.

Checklist

  • [x] CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

humblebundledore avatar Feb 16 '23 11:02 humblebundledore

Hello, it would be great if we could add some logic for populating the peers list. I think it would look somthing like this(not tested):

          args:
            - "-target=alertmanager"
            - "-config.file=/etc/cortex/cortex.yaml"
            {{- if gt (int .Values.alertmanager.replicas) 1}}
            {{- $fullName := include "cortex.alertmanagerFullname" . }}
            {{- range $i := until (int .Values.alertmanager.replicas) }}
            - --cluster.peer={{ $fullName }}-{{ $i }}.{{ $fullName }}-headless:9094
            {{- end }}

This looks to be the way that the prometheus-community alertmanager is handling peers. Source

dpericaxon avatar Mar 01 '23 18:03 dpericaxon

Hello, it would be great if we could add some logic for populating the peers list. I think it would look somthing like this(not tested):

          args:
            - "-target=alertmanager"
            - "-config.file=/etc/cortex/cortex.yaml"
            {{- if gt (int .Values.alertmanager.replicas) 1}}
            {{- $fullName := include "cortex.alertmanagerFullname" . }}
            {{- range $i := until (int .Values.alertmanager.replicas) }}
            - --cluster.peer={{ $fullName }}-{{ $i }}.{{ $fullName }}-headless:9094
            {{- end }}

This looks to be the way that the prometheus-community alertmanager is handling peers. Source

Is seeding the cluster peers necessary when we have memberlist enabled?

kd7lxl avatar Mar 08 '23 17:03 kd7lxl

Is seeding the cluster peers necessary when we have memberlist enabled?

I believe gossip memberlist and gossip cluster are two separate things here, please correct me if I am wrong but I asked in slack channel about that specific question : https://cloud-native.slack.com/archives/CCYDASBLP/p1676555646043829

humblebundledore avatar Mar 15 '23 13:03 humblebundledore

Is seeding the cluster peers necessary when we have memberlist enabled?

I believe gossip memberlist and gossip cluster are two separate things here, please correct me if I am wrong but I asked in slack channel about that specific question : https://cloud-native.slack.com/archives/CCYDASBLP/p1676555646043829

Pretty sure you are correct with this assessment. That's probably also why it's giving us so much grief.

nschad avatar Mar 15 '23 14:03 nschad

Is seeding the cluster peers necessary when we have memberlist enabled?

I believe gossip memberlist and gossip cluster are two separate things here, please correct me if I am wrong but I asked in slack channel about that specific question : https://cloud-native.slack.com/archives/CCYDASBLP/p1676555646043829

Pretty sure you are correct with this assessment. That's probably also why it's giving us so much grief.

I redeployed my alertmanagers by removing peers and result is at follow :

# values.yaml shipped to helm
config
  alertmanager:
    cluster:
      listen_address: '0.0.0.0:9094'
alertmanager:
  replicas: 2
 
# /multitenant_alertmanager/status
<h3>Members</h3>
Name | Addr
-- | --
01GVK00A2JF0H9DK4FYXKTCGXR | 172.17.0.13

# k get pods
cortex-alertmanager-0                   2/2     Running   0             6m35s   172.17.0.18   minikube   <none>           <none>
cortex-alertmanager-1                   2/2     Running   0             6m47s   172.17.0.13   minikube   <none>           <none>

so we indeed need to seed peers (too many "ee" in that phrase) and the two available options are :

  • array of alertmanager pod address (prometheus recommend)
  • alertmanager-headless service

to reply to @dpericaxon comment

Hello, it would be great if we could add some logic for populating the peers list. I think it would look something like this(not tested)

Isn't this a templating values.yaml problem rather than a pod/args templating problem since we are passing peers to the alertmanager via /etc/cortex/cortex.yaml ?

    Args:
      -target=alertmanager
      -config.file=/etc/cortex/cortex.yaml

humblebundledore avatar Mar 15 '23 16:03 humblebundledore

Where are we with this?

nschad avatar May 12 '23 06:05 nschad

I've consolidated the proposed changes here #460

Feel free to try it out and let me know if that works.

nschad avatar May 12 '23 07:05 nschad

I've consolidated the proposed changes here #460

Feel free to try it out and let me know if that works.

I am planning to come back on that issue next week, I will try #460 and keep posted.

humblebundledore avatar Jun 15 '23 09:06 humblebundledore

I've consolidated the proposed changes here #460 Feel free to try it out and let me know if that works.

I am planning to come back on that issue next week, I will try #460 and keep posted.

#460 has been implemented with comma separated list in https://github.com/cortexproject/cortex-helm-chart/pull/435/commits/1d830da0f1d2e96d6074f9b8d07ca051b886dc08

humblebundledore avatar Jul 03 '23 14:07 humblebundledore

@nschad / @kd7lxl - I thought to be mostly done here but I noticed a very strange behavior :

each replica of my alertmanager cluster is debugging about not finding old replica in cluster.go :

# k describe pods/cortex-base-alertmanager-0 -n cortex-base
level=debug ts=2023-07-03T16:13:40.509653841Z caller=cluster.go:441 component=cluster msg=reconnect result=failure peer=01H4E5FTYCW0FVZW78X181J77S addr=10.244.1.4:9094 err="1 error occurred:\n\t* Failed to join 10.244.1.4: dial tcp 10.244.1.4:9094: connect: no route to host\n\n"

level=debug ts=2023-07-03T16:15:20.50957654Z caller=cluster.go:339 component=cluster memberlist="2023/07/03 16:15:20 [DEBUG] memberlist: Failed to join 10.244.1.4: dial tcp 10.244.1.4:9094: connect: no route to host\n"

but my current peers are at follow :

# k describe pods/cortex-base-alertmanager-2 -n cortex-base | grep peer
      -alertmanager.cluster.peers=cortex-base-alertmanager-0.cortex-base-alertmanager-headless.cortex-base.svc.cluster.local:9094,cortex-base-alertmanager-1.cortex-base-alertmanager-headless.cortex-base.svc.cluster.local:9094,cortex-base-alertmanager-2.cortex-base-alertmanager-headless.cortex-base.svc.cluster.local:9094

# http://<forwarded-am-service>:<port>/multitenant_alertmanager/status
01H4E5G6WY5AWVHT8AAE1GB31A	10.244.1.5
01H4E85YE1XWTA82HQ24T3733D	        10.244.1.8
01H4E81T31RJ87MKW35J84RJSN	        10.244.1.7

I don't know how to explain that :confused: (yet), I am also very confused on that port 9094 that should be only cluster related and not memberlist. I may post that on Slack let's see.

humblebundledore avatar Jul 03 '23 16:07 humblebundledore

@nschad / @kd7lxl - I thought to be mostly done here but I noticed a very strange behavior :

each replica of my alertmanager cluster is debugging about not finding old replica in cluster.go :

# k describe pods/cortex-base-alertmanager-0 -n cortex-base
level=debug ts=2023-07-03T16:13:40.509653841Z caller=cluster.go:441 component=cluster msg=reconnect result=failure peer=01H4E5FTYCW0FVZW78X181J77S addr=10.244.1.4:9094 err="1 error occurred:\n\t* Failed to join 10.244.1.4: dial tcp 10.244.1.4:9094: connect: no route to host\n\n"

level=debug ts=2023-07-03T16:15:20.50957654Z caller=cluster.go:339 component=cluster memberlist="2023/07/03 16:15:20 [DEBUG] memberlist: Failed to join 10.244.1.4: dial tcp 10.244.1.4:9094: connect: no route to host\n"

but my current peers are at follow :

# k describe pods/cortex-base-alertmanager-2 -n cortex-base | grep peer
      -alertmanager.cluster.peers=cortex-base-alertmanager-0.cortex-base-alertmanager-headless.cortex-base.svc.cluster.local:9094,cortex-base-alertmanager-1.cortex-base-alertmanager-headless.cortex-base.svc.cluster.local:9094,cortex-base-alertmanager-2.cortex-base-alertmanager-headless.cortex-base.svc.cluster.local:9094

# http://<forwarded-am-service>:<port>/multitenant_alertmanager/status
01H4E5G6WY5AWVHT8AAE1GB31A	10.244.1.5
01H4E85YE1XWTA82HQ24T3733D	        10.244.1.8
01H4E81T31RJ87MKW35J84RJSN	        10.244.1.7

I don't know how to explain that :confused: (yet), I am also very confused on that port 9094 that should be only cluster related and not memberlist. I may post that on Slack let's see.

Might be normal that the ring has old instances in its ring temporarily. Do these logs disappear after a few minutes?

nschad avatar Jul 03 '23 17:07 nschad

Might be normal that the ring has old instances in its ring temporarily. Do these logs disappear after a few minutes?

In my case it seems to last abnormally long. I am curious to know if someone else have the same behavior. It is in debug logs thoo, maybe it is something I haven't noticed in the past too .. I am currently unsure.

humblebundledore avatar Jul 04 '23 14:07 humblebundledore

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Sep 17 '23 01:09 stale[bot]