consul
                                
                                 consul copied to clipboard
                                
                                    consul copied to clipboard
                            
                            
                            
                        Wrong Instance Count because of Cluster Restart
Overview of the Issue
When a Azure Cluster is restarted, the previous pods running inside the mesh are lost and new pods are created. Post restart Consul isn't picking up that the previous pods are deleted/not present. Hence, it is still trying to route to the pods resulting in the following error when I try to consume the service be UI/API.

Further more:
These are the active pods:

Whereas, Consul UI shows this,

Notice that there is only one frontend pod is running in AKS whereas UI shows two instances of the service
Reproduction Steps
Steps to reproduce this issue, eg:
- Create a service mesh within Azure Kubernetes Service.
- Stop and Start the AKS.
- Notice the previous pod info still exists in Consul UI whereas in reality it doesn't exist in AKS.
-->
Consul info for both Client and Server
Client info
/ $ consul info agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 0 build: prerelease = revision = 7c04b6a0 version = 1.15.1 version_metadata = consul: acl = disabled bootstrap = true known_datacenters = 1 leader = true leader_addr = 10.244.0.12:8300 server = true raft: applied_index = 2213 commit_index = 2213 fsm_pending = 0 last_contact = 0 last_log_index = 2213 last_log_term = 4 last_snapshot_index = 0 last_snapshot_term = 0 latest_configuration = [{Suffrage:Voter ID:b9744a41-cccd-861f-eca2-f3b18496e5b4 Address:10.244.0.12:8300}] latest_configuration_index = 0 num_peers = 0 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Leader term = 4 runtime: arch = amd64 cpu_count = 4 goroutines = 259 max_procs = 4 os = linux version = go1.20.1 serf_lan: coordinate_resets = 0 encrypted = false event_queue = 1 event_time = 4 failed = 0 health_score = 0 intent_queue = 1 left = 0 member_time = 4 members = 1 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 1 members = 1 query_queue = 0 query_time = 1Operating system and Environment details
Kubernetes Version: 1.24.10 Cloud Provider: Azure Environment: Azure Kubernetes Service
How about x-posting to https://github.com/hashicorp/consul-k8s?
@maheshrajrp The UI is correct in this case since each replica is a service instance. Your frontend pod deploys two replicas, each with its own IP address, so they are both registered as service instances.
@david-yu in reality the pod isn't running, which UI hasn't picked up ? one replica is running.
I see, apologies. We probably need more information on how to repro this to see if this indeed an issue. Helm chart configs, application deployment yaml, and logs from the consul k8s components. It probably would be better to file this issue in consul-k8s as was previously suggested.
Thanks for the suggestion. @david-yu @huikang Have posted this in consul-k8s. https://github.com/hashicorp/consul-k8s/issues/2085
I have this problem, too. I have a Service (S) that starts an executable Process (P) that registers itself with Consul and exposes a health check endpoint with a unique CheckID. Consul detects the registration, and all is green and good. I manually kill the process (P), and Consul detects the issue and marks Process (P) as unhealthy. Good. The service (S) detects the loss of the process. After a short interval elapses, Service (S) restarts another Process (P) with another unique CheckID. This new instance of Process (P) is the only instance running. Consul reports two Process (P), both healthy. I don't know if this is a bug, a timing issue, or if I've not done something correctly.
We have the same problem after restarting the Azure cluster. We have found the following workaround:
- We display all nodes: GET to https://ip/v1/catalog/nodes
- We remove the worker nodes that are no longer active from the pool: PUT to https://ip/v1/catalog/deregister with the following body:
{
  "Datacenter": "azure1",
  "Node": "aks-stablepool-12466665-vmss00001p-virtual"
}