terraform-provider-rancher2
terraform-provider-rancher2 copied to clipboard
[BUG] Nodes are not added to the external load balancer backend pool after load balancer is active
Rancher Server Setup
- Rancher version: v2.6.7
- Installation option: Helm Chart
- AKS
- Kubernetes: v1.23.8
- Proxy/Cert Details:
- ingress-nginx: v4.2.3
- cert-manager: v1.9.1
Information about the Cluster
- Kubernetes version: v1.23.8-rancher1-1
- Cluster Type (Local/Downstream): Downstream
- Infrastructure Provider: Azure
User Information
- What is the role of the user logged in? Admin/Cluster
Describe the bug When creating the downstream RKE cluster in Azure using node pools (1 master pool with etcd+control plane roles and 3 worker pools), the master gets created, then the load balancer is also created from the user addon. The master gets registered, and the 1st and sometimes the 2dn worker is also registered, but it is very likely that the load balancer is not active in the virtual network yet, giving the worker a gateway that works, allowing it to register.
Meanwhile, the load balancer finally gets active, and any new workers don't get the load balancer as the gateway, breaking their registration. The other worker nodes get stuck in "Registering" state and any added worker node through the Rancher UI scaling feature, gets stuck in "IP Resolved" until it times out and gets deleted.
So, logic would be that first of all, the load balancer should be created, and Rancher should actually wait/verify that it is active in the virtual network before it starts adding nodes.
And that is not being done, making me believe that there is a logic bug in Rancher itself.
To Reproduce
- Create the main AKS Rancher cluster
- Add peering between Rancher virtual network and RKE1 downstream virtual network
- Create the downstream RKE1 cluster with user addon job for creating the external load balancer
- Wait for the cluster to finish building successfully
Result Only initial master nodes and first worker node is registered into the Kubernetes cluster. The other worker nodes get stuck in "Registering" state and no additional nodes can be added using the Rancher UI, they get stuck in "IP resolved".
Expected Result All nodes are registered and I can scale up nodes through the Rancher UI.
Screenshots






Additional context The following code uses terraform to create the downstream RKE1 cluster with 1 master node pool (control plane+etcd) and 3 worker pools (system, kafka, and general) and a user addon to create an external load balancer:
resource "azurerm_resource_group" "rke" {
name = "${var.resource_group}-${var.rke_name_prefix}-rg"
location = var.azure_region
}
resource "azurerm_virtual_network" "rke" {
name = "${var.rke_name_prefix}-vnet"
address_space = var.rke_address_space
location = var.azure_region
resource_group_name = azurerm_resource_group.rke.name
}
resource "azurerm_subnet" "rke" {
name = "${var.rke_name_prefix}-subnet"
resource_group_name = azurerm_resource_group.rke.name
virtual_network_name = azurerm_virtual_network.rke.name
address_prefixes = var.rke_address_prefixes
}
## Create Vnet Peering Between Rancher cluster and downstream RKE cluster
resource "azurerm_virtual_network_peering" "rancher" {
name = "rancher-vnet-peering"
resource_group_name = azurerm_resource_group.rke.name
virtual_network_name = azurerm_virtual_network.rke.name
remote_virtual_network_id = var.rancher_vnet_id
}
data "azurerm_virtual_network" "rke" {
name = azurerm_virtual_network.rke.name
resource_group_name = azurerm_virtual_network.rke.resource_group_name
}
resource "azurerm_virtual_network_peering" "rke" {
name = "rke-vnet-peering"
resource_group_name = var.rancher_rg_name
virtual_network_name = var.rancher_vnet_name
remote_virtual_network_id = data.azurerm_virtual_network.rke.id
depends_on = [data.azurerm_virtual_network.rke]
}
## Create Network Security Groups
resource "azurerm_network_security_group" "worker" {
name = "worker-nsg"
location = azurerm_resource_group.rke.location
resource_group_name = azurerm_resource_group.rke.name
security_rule {
name = "SSH_IN"
priority = 100
direction = "Inbound"
access = "Allow"
protocol = "Tcp"
source_port_range = "*"
destination_port_range = 22
source_address_prefix = "*"
destination_address_prefix = "*"
}
security_rule {
name = "CanalOverlay_IN"
priority = 110
direction = "Inbound"
access = "Allow"
protocol = "Udp"
source_port_range = "*"
destination_port_range = 8472
source_address_prefix = "*"
destination_address_prefix = "*"
}
security_rule {
name = "CanalProbe_IN"
priority = 120
direction = "Inbound"
access = "Allow"
protocol = "Tcp"
source_port_range = "*"
destination_port_range = 9099
source_address_prefix = "*"
destination_address_prefix = "*"
}
security_rule {
name = "IngressProbe_IN"
priority = 130
direction = "Inbound"
access = "Allow"
protocol = "Tcp"
source_port_range = "*"
destination_port_range = 10254
source_address_prefix = "*"
destination_address_prefix = "*"
}
security_rule {
name = "NodePort_UDP_IN"
priority = 140
direction = "Inbound"
access = "Allow"
protocol = "Udp"
source_port_range = "*"
destination_port_range = "30000-32767"
source_address_prefix = "*"
destination_address_prefix = "*"
}
security_rule {
name = "NodePort_TCP_IN"
priority = 150
direction = "Inbound"
access = "Allow"
protocol = "Tcp"
source_port_range = "*"
destination_port_range = "30000-32767"
source_address_prefix = "*"
destination_address_prefix = "*"
}
security_rule {
name = "HttpsIngress_IN"
priority = 160
direction = "Inbound"
access = "Allow"
protocol = "Tcp"
source_port_range = "*"
destination_port_range = 443
source_address_prefix = "*"
destination_address_prefix = "*"
}
security_rule {
name = "HttpIngress_IN"
priority = 170
direction = "Inbound"
access = "Allow"
protocol = "Tcp"
source_port_range = "*"
destination_port_range = 80
source_address_prefix = "*"
destination_address_prefix = "*"
}
security_rule {
name = "DockerDaemon_IN"
priority = 180
direction = "Inbound"
access = "Allow"
protocol = "Tcp"
source_port_range = "*"
destination_port_range = 2376
source_address_prefix = "*"
destination_address_prefix = "*"
}
security_rule {
name = "Metrics_IN"
priority = 190
direction = "Inbound"
access = "Allow"
protocol = "Tcp"
source_port_range = "*"
destination_port_range = 10250
source_address_prefix = "*"
destination_address_prefix = "*"
}
security_rule {
name = "KubeAPI_IN"
priority = 200
direction = "Inbound"
access = "Allow"
protocol = "Tcp"
source_port_range = "*"
destination_port_range = 6443
source_address_prefix = "*"
destination_address_prefix = "*"
}
}
resource "azurerm_network_security_group" "control" {
name = "control-nsg"
location = azurerm_resource_group.rke.location
resource_group_name = azurerm_resource_group.rke.name
security_rule {
name = "SSH_IN"
priority = 100
direction = "Inbound"
access = "Allow"
protocol = "Tcp"
source_port_range = "*"
destination_port_range = 22
source_address_prefix = "*"
destination_address_prefix = "*"
}
security_rule {
name = "CanalOverlay_IN"
priority = 110
direction = "Inbound"
access = "Allow"
protocol = "Udp"
source_port_range = "*"
destination_port_range = 8472
source_address_prefix = "*"
destination_address_prefix = "*"
}
security_rule {
name = "CanalProbe_IN"
priority = 120
direction = "Inbound"
access = "Allow"
protocol = "Tcp"
source_port_range = "*"
destination_port_range = 9099
source_address_prefix = "*"
destination_address_prefix = "*"
}
security_rule {
name = "IngressProbe_IN"
priority = 130
direction = "Inbound"
access = "Allow"
protocol = "Tcp"
source_port_range = "*"
destination_port_range = 10254
source_address_prefix = "*"
destination_address_prefix = "*"
}
security_rule {
name = "Etcd_IN"
priority = 140
direction = "Inbound"
access = "Allow"
protocol = "Tcp"
source_port_range = "*"
destination_port_range = "2379-2380"
source_address_prefix = "*"
destination_address_prefix = "*"
}
security_rule {
name = "DockerDaemon_IN"
priority = 170
direction = "Inbound"
access = "Allow"
protocol = "Tcp"
source_port_range = "*"
destination_port_range = 2376
source_address_prefix = "*"
destination_address_prefix = "*"
}
security_rule {
name = "Metrics_IN"
priority = 180
direction = "Inbound"
access = "Allow"
protocol = "Tcp"
source_port_range = "*"
destination_port_range = 10250
source_address_prefix = "*"
destination_address_prefix = "*"
}
security_rule {
name = "HttpsIngress_IN"
priority = 190
direction = "Inbound"
access = "Allow"
protocol = "Tcp"
source_port_range = "*"
destination_port_range = 443
source_address_prefix = "*"
destination_address_prefix = "*"
}
security_rule {
name = "HttpIngress_IN"
priority = 200
direction = "Inbound"
access = "Allow"
protocol = "Tcp"
source_port_range = "*"
destination_port_range = 80
source_address_prefix = "*"
destination_address_prefix = "*"
}
security_rule {
name = "KubeAPI_IN"
priority = 210
direction = "Inbound"
access = "Allow"
protocol = "Tcp"
source_port_range = "*"
destination_port_range = 6443
source_address_prefix = "*"
destination_address_prefix = "*"
}
security_rule {
name = "NodePort_UDP_IN"
priority = 220
direction = "Inbound"
access = "Allow"
protocol = "Udp"
source_port_range = "*"
destination_port_range = "30000-32767"
source_address_prefix = "*"
destination_address_prefix = "*"
}
security_rule {
name = "NodePort_TCP_IN"
priority = 230
direction = "Inbound"
access = "Allow"
protocol = "Tcp"
source_port_range = "*"
destination_port_range = "30000-32767"
source_address_prefix = "*"
destination_address_prefix = "*"
}
}
## Create Availability Sets
resource "azurerm_availability_set" "control" {
name = "control-availset"
location = azurerm_resource_group.rke.location
resource_group_name = azurerm_resource_group.rke.name
}
resource "azurerm_availability_set" "system" {
name = "system-availset"
location = azurerm_resource_group.rke.location
resource_group_name = azurerm_resource_group.rke.name
}
resource "azurerm_availability_set" "general" {
name = "general-availset"
location = azurerm_resource_group.rke.location
resource_group_name = azurerm_resource_group.rke.name
}
resource "azurerm_availability_set" "kafka" {
name = "kafka-availset"
location = azurerm_resource_group.rke.location
resource_group_name = azurerm_resource_group.rke.name
}
## Create a new rancher2 RKE Cluster
resource "rancher2_cluster" "rke" {
name = "${var.rke_name_prefix}-cluster"
description = "Downstream RKE Cluster"
cluster_auth_endpoint {
enabled = true
}
rke_config {
ignore_docker_version = false
kubernetes_version = "v${var.kubernetes_version}-rancher1-1"
authentication {
strategy = "x509|webhook"
}
network {
plugin = "canal"
}
ingress {
provider = "nginx"
network_mode = "none"
http_port = 8080
https_port = 8443
default_backend = false
node_selector = var.system_template.labels
}
services {
etcd {
backup_config {
enabled = true
interval_hours = 12
retention = 6
}
creation = "12h"
retention = "72h"
snapshot = false
}
kube_api {
pod_security_policy = false
service_node_port_range = "30000-32767"
}
}
addons = "${file("${path.module}/addons/loadbalancer.yaml")}"
cloud_provider {
name = "azure"
azure_cloud_provider {
aad_client_id = azuread_application.app.application_id
aad_client_secret = azuread_service_principal_password.auth.value
subscription_id = data.azurerm_subscription.subscription.subscription_id
tenant_id = data.azurerm_subscription.subscription.tenant_id
load_balancer_sku = "standard"
subnet_name = azurerm_subnet.rke.name
vnet_name = azurerm_virtual_network.rke.name
resource_group = azurerm_resource_group.rke.name
use_instance_metadata = true
vm_type = "standard"
primary_availability_set_name = azurerm_availability_set.system.name
use_managed_identity_extension = false
}
}
}
provider = rancher2.admin
}
## Create Node Templates
resource "rancher2_node_template" "control" {
name = "control-template"
description = "Node Template for RKE Cluster on Azure"
cloud_credential_id = rancher2_cloud_credential.cloud_credential.id
engine_install_url = "https://releases.rancher.com/install-docker/20.10.sh"
labels = var.control_template.labels
azure_config {
managed_disks = var.control_template.managed_disks
location = azurerm_resource_group.rke.location
image = var.control_template.image
size = var.control_template.size
storage_type = var.control_template.storage_type
resource_group = azurerm_resource_group.rke.name
no_public_ip = var.control_template.no_public_ip
subnet = azurerm_subnet.rke.name
vnet = azurerm_virtual_network.rke.name
nsg = azurerm_network_security_group.control.name
availability_set = azurerm_availability_set.control.name
ssh_user = var.admin_username
}
provider = rancher2.admin
}
resource "rancher2_node_template" "system" {
name = "system-template"
description = "Node Template for RKE Cluster on Azure"
cloud_credential_id = rancher2_cloud_credential.cloud_credential.id
engine_install_url = "https://releases.rancher.com/install-docker/20.10.sh"
labels = var.system_template.labels
azure_config {
managed_disks = var.system_template.managed_disks
location = azurerm_resource_group.rke.location
image = var.system_template.image
size = var.system_template.size
storage_type = var.system_template.storage_type
resource_group = azurerm_resource_group.rke.name
no_public_ip = var.system_template.no_public_ip
subnet = azurerm_subnet.rke.name
vnet = azurerm_virtual_network.rke.name
nsg = azurerm_network_security_group.worker.name
availability_set = azurerm_availability_set.system.name
ssh_user = var.admin_username
}
provider = rancher2.admin
}
resource "rancher2_node_template" "kafka" {
name = "kafka-template"
description = "Node Template for RKE Cluster on Azure"
cloud_credential_id = rancher2_cloud_credential.cloud_credential.id
engine_install_url = "https://releases.rancher.com/install-docker/20.10.sh"
labels = var.kafka_template.labels
azure_config {
managed_disks = var.kafka_template.managed_disks
location = azurerm_resource_group.rke.location
image = var.kafka_template.image
size = var.kafka_template.size
storage_type = var.kafka_template.storage_type
resource_group = azurerm_resource_group.rke.name
no_public_ip = var.kafka_template.no_public_ip
subnet = azurerm_subnet.rke.name
vnet = azurerm_virtual_network.rke.name
nsg = azurerm_network_security_group.worker.name
availability_set = azurerm_availability_set.kafka.name
ssh_user = var.admin_username
}
provider = rancher2.admin
}
resource "rancher2_node_template" "general" {
name = "general-template"
description = "Node Template for RKE Cluster on Azure"
cloud_credential_id = rancher2_cloud_credential.cloud_credential.id
engine_install_url = "https://releases.rancher.com/install-docker/20.10.sh"
labels = var.general_template.labels
azure_config {
managed_disks = var.general_template.managed_disks
location = azurerm_resource_group.rke.location
image = var.general_template.image
size = var.general_template.size
storage_type = var.general_template.storage_type
resource_group = azurerm_resource_group.rke.name
no_public_ip = var.system_template.no_public_ip
subnet = azurerm_subnet.rke.name
vnet = azurerm_virtual_network.rke.name
nsg = azurerm_network_security_group.worker.name
availability_set = azurerm_availability_set.general.name
ssh_user = var.admin_username
}
provider = rancher2.admin
}
## Create Node Pools
resource "rancher2_node_pool" "control" {
cluster_id = rancher2_cluster.rke.id
name = "control-node-pool"
hostname_prefix = "control"
node_template_id = rancher2_node_template.control.id
quantity = var.control_pool.quantity
control_plane = true
etcd = true
worker = false
labels = var.control_pool.labels
provider = rancher2.admin
}
resource "rancher2_node_pool" "system" {
cluster_id = rancher2_cluster.rke.id
name = "system-node-pool"
hostname_prefix = "system"
node_template_id = rancher2_node_template.system.id
quantity = var.system_pool.quantity
control_plane = false
etcd = false
worker = true
labels = var.system_pool.labels
provider = rancher2.admin
}
resource "rancher2_node_pool" "kafka" {
cluster_id = rancher2_cluster.rke.id
name = "kafka-node-pool"
hostname_prefix = "kafka"
node_template_id = rancher2_node_template.kafka.id
quantity = var.kafka_pool.quantity
control_plane = false
etcd = false
worker = true
labels = var.kafka_pool.labels
provider = rancher2.admin
}
resource "rancher2_node_pool" "general" {
cluster_id = rancher2_cluster.rke.id
name = "general-pool"
hostname_prefix = "general"
node_template_id = rancher2_node_template.general.id
quantity = var.general_pool.quantity
control_plane = false
etcd = false
worker = true
labels = var.general_pool.labels
provider = rancher2.admin
}
## Create a new rancher2 Cluster Sync
resource "rancher2_cluster_sync" "rke" {
cluster_id = rancher2_cluster.rke.id
state_confirm = 90
node_pool_ids = [rancher2_node_pool.control.id, rancher2_node_pool.system.id, rancher2_node_pool.kafka.id, rancher2_node_pool.general.id]
provider = rancher2.admin
}
Addon used to expose the ingress controller using a cloud load balancer:
# external load balancer
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/component: controller
app.kubernetes.io/instance: ingress-nginx
app.kubernetes.io/name: ingress-nginx
app.kubernetes.io/part-of: ingress-nginx
name: ingress-nginx-controller
namespace: ingress-nginx
spec:
externalTrafficPolicy: Cluster
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
ports:
- name: http
port: 80
protocol: TCP
targetPort: http
- name: https
port: 443
protocol: TCP
targetPort: https
selector:
app.kubernetes.io/component: controller
app.kubernetes.io/instance: ingress-nginx
app.kubernetes.io/name: ingress-nginx
type: LoadBalancer
Information about nodes, pods, and services with the Rancher CLI
diclonius@pop-os:~/rancher-project$ rancher kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
control-plane2 Ready controlplane,etcd 36m v1.23.8 10.100.0.8 <none> Ubuntu 20.04.4 LTS 5.15.0-1017-azure docker://20.10.12
general1 Ready worker 32m v1.23.8 10.100.0.5 <none> Ubuntu 20.04.4 LTS 5.15.0-1017-azure docker://20.10.12
system1 Ready worker 32m v1.23.8 10.100.0.6 <none> Ubuntu 20.04.4 LTS 5.15.0-1017-azure docker://20.10.12
diclonius@pop-os:~/rancher-project$ rancher nodes
ID NAME STATE POOL DESCRIPTION
c-5kqcl:m-bw8jv control-plane2 active control-plane
c-5kqcl:m-chwdg kafka2 registering kafka
c-5kqcl:m-lxff8 system1 active system
c-5kqcl:m-mmbk5 general1 active general
diclonius@pop-os:~/rancher-project$ rancher kubectl get pod --all-namespaces -o wide
NAME STATUS ROLES AGE VERSION
control-plane1 Ready controlplane,etcd 28m v1.23.8
kafka1 Ready worker 25m v1.23.8
diclonius@pop-os:~/rancher-project$ rancher kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
control-plane1 Ready controlplane,etcd 28m v1.23.8 10.100.0.4 <none> Ubuntu 20.04.4 LTS 5.15.0-1017-azure docker://20.10.12
kafka1 Ready worker 25m v1.23.8 10.100.0.5 <none> Ubuntu 20.04.4 LTS 5.15.0-1017-azure docker://20.10.12
diclonius@pop-os:~/rancher-project$ rancher kubectl get pod --all-namespaces -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
cattle-fleet-system fleet-agent-7f8ddd996f-pkkqm 1/1 Running 0 30m 10.42.1.9 general1 <none> <none>
cattle-system cattle-cluster-agent-75d4dbdf69-kfzjl 1/1 Running 6 (32m ago) 35m 10.42.0.5 control-plane2 <none> <none>
cattle-system cattle-cluster-agent-75d4dbdf69-xxbrv 1/1 Running 0 31m 10.42.2.4 system1 <none> <none>
cattle-system cattle-node-agent-jpvm5 1/1 Running 0 32m 10.100.0.5 general1 <none> <none>
cattle-system cattle-node-agent-nxldl 1/1 Running 0 32m 10.100.0.6 system1 <none> <none>
cattle-system cattle-node-agent-vntxh 1/1 Running 0 35m 10.100.0.8 control-plane2 <none> <none>
cattle-system kube-api-auth-hn6m4 1/1 Running 0 35m 10.100.0.8 control-plane2 <none> <none>
ingress-nginx ingress-nginx-admission-create-kzfww 0/1 Completed 0 35m 10.42.0.3 control-plane2 <none> <none>
ingress-nginx ingress-nginx-admission-patch-dvb2d 0/1 Completed 0 35m 10.42.0.4 control-plane2 <none> <none>
ingress-nginx nginx-ingress-controller-rxrp4 1/1 Running 0 32m 10.42.2.2 system1 <none> <none>
ingress-nginx nginx-ingress-controller-vh46n 1/1 Running 0 32m 10.42.1.5 general1 <none> <none>
kube-system calico-kube-controllers-fc7fcb565-ptdpb 1/1 Running 0 36m 10.42.0.2 control-plane2 <none> <none>
kube-system canal-j7jg8 2/2 Running 0 36m 10.100.0.8 control-plane2 <none> <none>
kube-system canal-vmgcp 2/2 Running 0 32m 10.100.0.5 general1 <none> <none>
kube-system canal-vrtrx 2/2 Running 0 32m 10.100.0.6 system1 <none> <none>
kube-system coredns-548ff45b67-cjksg 1/1 Running 0 36m 10.42.1.4 general1 <none> <none>
kube-system coredns-548ff45b67-jsv6l 1/1 Running 0 31m 10.42.2.3 system1 <none> <none>
kube-system coredns-autoscaler-d5944f655-gz9gc 1/1 Running 0 36m 10.42.1.3 general1 <none> <none>
kube-system metrics-server-5c4895ffbd-5phcq 1/1 Running 0 35m 10.42.1.2 general1 <none> <none>
kube-system rke-coredns-addon-deploy-job-nmwvx 0/1 Completed 0 36m 10.100.0.8 control-plane2 <none> <none>
kube-system rke-ingress-controller-deploy-job-jzgjk 0/1 Completed 0 35m 10.100.0.8 control-plane2 <none> <none>
kube-system rke-metrics-addon-deploy-job-7rs45 0/1 Completed 0 36m 10.100.0.8 control-plane2 <none> <none>
kube-system rke-network-plugin-deploy-job-hm4jz 0/1 Completed 0 36m 10.100.0.8 control-plane2 <none> <none>
kube-system rke-user-addon-deploy-job-q9bt5 0/1 Completed 0 35m 10.100.0.8 control-plane2 <none> <none>
diclonius@pop-os:~/rancher-project$ rancher kubectl get svc --all-namespaces -o wide
INFO[0000] Saving config to /home/diclonius/.rancher/cli2.json
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
cattle-system cattle-cluster-agent ClusterIP 10.43.97.201 <none> 80/TCP,443/TCP 26m app=cattle-cluster-agent
default kubernetes ClusterIP 10.43.0.1 <none> 443/TCP 28m <none>
ingress-nginx ingress-nginx-controller LoadBalancer 10.43.88.81 20.81.13.20 80:31614/TCP,443:31288/TCP 26m app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx
ingress-nginx ingress-nginx-controller-admission ClusterIP 10.43.138.192 <none> 443/TCP 26m app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx
kube-system kube-dns ClusterIP 10.43.0.10 <none> 53/UDP,53/TCP,9153/TCP 27m k8s-app=kube-dns
kube-system metrics-server ClusterIP 10.43.56.2 <none> 443/TCP 27m k8s-app=metrics-server
Provisioning Log for the cluster
DNS configuration All nodes have the same DNS config.
root@kafka2:~# cat /etc/resolv.conf
# This file is managed by man:systemd-resolved(8). Do not edit.
#
# This is a dynamic resolv.conf file for connecting local clients to the
# internal DNS stub resolver of systemd-resolved. This file lists all
# configured search domains.
#
# Run "resolvectl status" to see details about the uplink DNS servers
# currently in use.
#
# Third party programs must not access this file directly, but only through the
# symlink at /etc/resolv.conf. To manage man:resolv.conf(5) in a different way,
# replace this symlink by a static file or a different symlink.
#
# See man:systemd-resolved.service(8) for details about the supported modes of
# operation for /etc/resolv.conf.
nameserver 127.0.0.53
options edns0 trust-ad
search u1hmh22cgynu5pyvgyrmdt5hig.ax.internal.cloudapp.net
Rancher agent container logs Rancher agent logs of stuck node in "Registering" state.
root@kafka2:~# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
c1d9959546bb rancher/rke-tools:v0.1.87 "nginx-proxy CP_HOST…" 41 minutes ago Up 41 minutes nginx-proxy
42fd0a67acd3 rancher/hyperkube:v1.23.8-rancher1 "/opt/rke-tools/entr…" 41 minutes ago Up 41 minutes kubelet
ed808063f60d rancher/hyperkube:v1.23.8-rancher1 "/opt/rke-tools/entr…" 41 minutes ago Up 41 minutes kube-proxy
fbe82d7c2e8e rancher/rancher-agent:v2.6.7 "run.sh --server htt…" 45 minutes ago Up 45 minutes exciting_pascal
root@kafka2:~# docker logs fbe82d7c2e8e
ime="2022-09-06T08:49:46Z" level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp 20.42.192.76:443: i/o timeout"
time="2022-09-06T08:49:46Z" level=error msg="Remotedialer proxy error" error="dial tcp 20.42.192.76:443: i/o timeout"
time="2022-09-06T08:49:56Z" level=info msg="Connecting to wss://rancher.sauron.mordor.net/v3/connect with token starting with pt9mgr2wgkvq4hxvxlpsf44jl67"
time="2022-09-06T08:49:56Z" level=info msg="Connecting to proxy" url="wss://rancher.sauron.mordor.net/v3/connect"
time="2022-09-06T08:50:06Z" level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp 20.42.192.76:443: i/o timeout"
time="2022-09-06T08:50:06Z" level=error msg="Remotedialer proxy error" error="dial tcp 20.42.192.76:443: i/o timeout"
time="2022-09-06T08:50:16Z" level=info msg="Connecting to wss://rancher.sauron.mordor.net/v3/connect with token starting with pt9mgr2wgkvq4hxvxlpsf44jl67"
time="2022-09-06T08:50:16Z" level=info msg="Connecting to proxy" url="wss://rancher.sauron.mordor.net/v3/connect"
time="2022-09-06T08:50:20Z" level=warning msg="Error while getting agent config: Get \"https://rancher.sauron.mordor.net/v3/connect/config\": dial tcp 20.42.192.76:443: i/o timeout"
time="2022-09-06T08:50:26Z" level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp 20.42.192.76:443: i/o timeout"
time="2022-09-06T08:50:26Z" level=error msg="Remotedialer proxy error" error="dial tcp 20.42.192.76:443: i/o timeout"
EDIT
So the issue was that when a VM is part of an Availability Set, if another VM which is part of the same Availability Set is in a public Load Balancer backend, all VMs part of that Availability Set will use the LB public IP for outbound connection. But the problem is that the VM which is not part of the LB backend then cannot reach the internet because they cannot use the LB GW.
So we have to use a NAT GW attached to the subnet. That way, all VMs that are not part of the LB backend but are part of the Availability Set used for the LB can still have internet access using the NAT GW.
Normally, Rancher should first verify the LB is active once created (at the end of cluster creation), then add the nodes to it. But here the issue is also that first before the cluster gets created, all nodes are created and at that moment no lb exists. First, all nodes (masters and workers) are created then it starts installing stuff on masters, and only at the end, it starts creating the LB. so when the LB is created, the nodes already exist and they already started registering, installing stuff, etc. the first registered nodes then go inside the backend pool.
What happens is that the initial nodes start registering because they use the default Azure GW to access the internet (as all private nodes have internet access by default on azure for outbound connection and no nodes part of the same Availability Set is inside a public LB backend pool).
But as soon as the LB is active, it will start adding nodes to the backend AND when the first node using the LB backend Availability Set is added to the LB backend, all other nodes use the LB public IP without being added to the backend pool because rancher only adds registered nodes to the backend pool and for that, they need internet access to get their config and install stuff on it, etc.
This means, Rancher should actually wait for the LB to be active, and only at that moment, add FIRST nodes to the backend pool of the LB and then once they are part of the LB, start installing docker, etc and get their config from the rancher server. Otherwise, the VMs try to reach the internet using the LB public IP but they cannot use the LB GW without being in the backend.
So if we add a NAT GW on that subnet, they can use the NAT GW until they have been added to the LB backend pool, and at that moment they will use the LB GW instead of the NAT GW.