helm-charts ClusterMetadata: Default cluster potentially uses wrong frontend RPC address

We're working on standing up the temporal service via helm and I noticed this while I was configuring the various yaml files. If a user configures a custom gRPC port for the frontend service, then the hardcoded default of 7933 will be incorrect.

https://github.com/temporalio/helm-charts/blob/2fb4639bec104734c0bfd50ff2832fced1772c3a/templates/server-configmap.yaml#L182

It also seems that the localhost address 127.0.0.1 address would be incorrect in a deployed environment assuming that the various services (history, matching, frontend, worker) are deployed separately.

https://github.com/temporalio/temporal/blob/e2e26004552cbc0867afb342238bb3f9efeee6ce/client/clientBean.go#L87-L96

Mar 23 '21 16:03 thempatel

@thempatel did you happen to resolve this in your environment? I believe I'm running into a similar issue.

Nov 17 '22 16:11 emmercm

@emmercm we ended up forking the helm chart for temporal to fix the various bugs in it. For this one, I did end up changing the hard coded port to instead be sourced from user configuration (values.yaml).

note: it's been a really long time, so take this with a grain of salt:

IIRC, the localhost is OK because i think there's actually a proxy that listens on localhost for the frontend service, so connecting to localhost will just forward the connection to the internally configured frontend RPC service which will then forward to the actual services. 🤷🏽‍♂️

Nov 17 '22 20:11 thempatel

@thempatel we've also forked the chart, but more so we can better configure our unique Kubernetes environment and multi-cluster than anything.

The proxy would make a ton of sense, but I didn't find any trace of it in GitHub: https://github.com/search?q=org%3Atemporalio+7933&type=code. I would think localhost in this case would be Kube node-local rather than container-local, right? I'm running into issues with multi-cluster where I believe I'm getting some cross-talk, and I've convinced myself it's this localhost config.

Nov 18 '22 16:11 emmercm

@emmercm after reading #333 , noticed you're trying to run 2 unique temporal clusters. you cannot do this without isolating them, the services use a gossip protocol where they broadcast messages on a port. if your two clusters have services that are all broadcasting on the same port, but you've configured two different storage instances (sql, etc), you're going to run into problems.

The reason why I filed this (and subsequently forked) was exactly so that we could run multiple clusters all configured using different ports so that the two clusters don't run into each other.

One thing you could try to see if it solves your problem (if you haven't already), is to configure each of those clusters to be in their own K8s namespaces. If that works, then you'll just need to account for adding namespaces within the connection to the cluster in your clients.

Nov 18 '22 16:11 thempatel

For some reason I swore the gossip behavior was deprecated/removed in Temporal as a step away from Cadence, but checking a quick tctl admin membership list_gossip shows all the pods that I would expect. Thank you for redirecting me on this one, this probably helps explain some of the behavior I'm seeing.

Nov 21 '22 19:11 emmercm

This was super helpful! After a fresh deployment, no communications with the task queues were working. I was getting context deadline timeouts on tctl tq describe --taskqueue all.

Then following this thread I changed the rpcAddress to match the frontend service name and port in my cluster: rpcAddress: "temporal-frontend:7233"

Now I'm able to list task queues and I can see that the matching server joined tctl admin membership list_gossip

Jan 29 '24 10:01 dmateusp