consul icon indicating copy to clipboard operation
consul copied to clipboard

Wrong Instance Count because of Cluster Restart

Open maheshrajrp opened this issue 2 years ago • 7 comments

Overview of the Issue

When a Azure Cluster is restarted, the previous pods running inside the mesh are lost and new pods are created. Post restart Consul isn't picking up that the previous pods are deleted/not present. Hence, it is still trying to route to the pods resulting in the following error when I try to consume the service be UI/API.

image

Further more: These are the active pods: image

Whereas, Consul UI shows this, image

Notice that there is only one frontend pod is running in AKS whereas UI shows two instances of the service


Reproduction Steps

Steps to reproduce this issue, eg:

  1. Create a service mesh within Azure Kubernetes Service.
  2. Stop and Start the AKS.
  3. Notice the previous pod info still exists in Consul UI whereas in reality it doesn't exist in AKS.

-->

Consul info for both Client and Server

Client info / $ consul info agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 0 build: prerelease = revision = 7c04b6a0 version = 1.15.1 version_metadata = consul: acl = disabled bootstrap = true known_datacenters = 1 leader = true leader_addr = 10.244.0.12:8300 server = true raft: applied_index = 2213 commit_index = 2213 fsm_pending = 0 last_contact = 0 last_log_index = 2213 last_log_term = 4 last_snapshot_index = 0 last_snapshot_term = 0 latest_configuration = [{Suffrage:Voter ID:b9744a41-cccd-861f-eca2-f3b18496e5b4 Address:10.244.0.12:8300}] latest_configuration_index = 0 num_peers = 0 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Leader term = 4 runtime: arch = amd64 cpu_count = 4 goroutines = 259 max_procs = 4 os = linux version = go1.20.1 serf_lan: coordinate_resets = 0 encrypted = false event_queue = 1 event_time = 4 failed = 0 health_score = 0 intent_queue = 1 left = 0 member_time = 4 members = 1 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 1 members = 1 query_queue = 0 query_time = 1

Operating system and Environment details

Kubernetes Version: 1.24.10 Cloud Provider: Azure Environment: Azure Kubernetes Service

maheshrajrp avatar Apr 16 '23 02:04 maheshrajrp

How about x-posting to https://github.com/hashicorp/consul-k8s?

huikang avatar Apr 19 '23 23:04 huikang

@maheshrajrp The UI is correct in this case since each replica is a service instance. Your frontend pod deploys two replicas, each with its own IP address, so they are both registered as service instances.

david-yu avatar Apr 24 '23 18:04 david-yu

@david-yu in reality the pod isn't running, which UI hasn't picked up ? one replica is running.

maheshrajrp avatar Apr 24 '23 18:04 maheshrajrp

I see, apologies. We probably need more information on how to repro this to see if this indeed an issue. Helm chart configs, application deployment yaml, and logs from the consul k8s components. It probably would be better to file this issue in consul-k8s as was previously suggested.

david-yu avatar Apr 24 '23 18:04 david-yu

Thanks for the suggestion. @david-yu @huikang Have posted this in consul-k8s. https://github.com/hashicorp/consul-k8s/issues/2085

maheshrajrp avatar Apr 24 '23 23:04 maheshrajrp

I have this problem, too. I have a Service (S) that starts an executable Process (P) that registers itself with Consul and exposes a health check endpoint with a unique CheckID. Consul detects the registration, and all is green and good. I manually kill the process (P), and Consul detects the issue and marks Process (P) as unhealthy. Good. The service (S) detects the loss of the process. After a short interval elapses, Service (S) restarts another Process (P) with another unique CheckID. This new instance of Process (P) is the only instance running. Consul reports two Process (P), both healthy. I don't know if this is a bug, a timing issue, or if I've not done something correctly.

CarlMCook avatar Nov 13 '23 21:11 CarlMCook

We have the same problem after restarting the Azure cluster. We have found the following workaround:

  1. We display all nodes: GET to https://ip/v1/catalog/nodes
  2. We remove the worker nodes that are no longer active from the pool: PUT to https://ip/v1/catalog/deregister with the following body:
{
  "Datacenter": "azure1",
  "Node": "aks-stablepool-12466665-vmss00001p-virtual"
}

helbling-gum avatar Feb 05 '24 10:02 helbling-gum