hcloud-cloud-controller-manager icon indicating copy to clipboard operation
hcloud-cloud-controller-manager copied to clipboard

New node was not registered in LB targets

Open nmuntyanov opened this issue 9 months ago • 6 comments

Hello guys! I have an issue with hcloud controller, newly added node was not added into LB targets.

I0925 08:40:41.339617       1 node_controller.go:431] Initializing node pool1-44283f2f892250c3 with cloud provider
I0925 08:40:41.339753       1 load_balancers.go:283] "update Load Balancer" op="hcloud/loadBalancers.UpdateLoadBalancer" service="ingress-nginx-controller" nodes=["pool1-1bebb963a31e0dcc","pool1-224cc2e7d3d7112","pool1-70ba3df099972c15","pool1-44283f2f892250c3"]
I0925 08:40:41.506786       1 node_controller.go:502] Successfully initialized node pool1-44283f2f892250c3 with cloud provider
I0925 08:40:41.506851       1 node_controller.go:431] Initializing node pool1-44283f2f892250c3 with cloud provider
I0925 08:40:41.509142       1 event.go:307] "Event occurred" object="pool1-44283f2f892250c3" fieldPath="" kind="Node" apiVersion="v1" type="Normal" reason="Synced" message="Node synced successfully"
I0925 08:40:41.647178       1 node_controller.go:502] Successfully initialized node pool1-44283f2f892250c3 with cloud provider
I0925 08:40:41.647316       1 event.go:307] "Event occurred" object="pool1-44283f2f892250c3" fieldPath="" kind="Node" apiVersion="v1" type="Normal" reason="Synced" message="Node synced successfully"
I0925 08:40:50.603418       1 route_controller.go:216] action for Node "fluix-sand-control-plane" with CIDR "10.244.0.0/24": "keep"
I0925 08:40:50.603468       1 route_controller.go:216] action for Node "pool1-1bebb963a31e0dcc" with CIDR "10.244.12.0/24": "keep"
I0925 08:40:50.603491       1 route_controller.go:216] action for Node "pool1-224cc2e7d3d7112" with CIDR "10.244.18.0/24": "keep"
I0925 08:40:50.603509       1 route_controller.go:216] action for Node "pool1-70ba3df099972c15" with CIDR "10.244.7.0/24": "keep"
I0925 08:40:50.603527       1 route_controller.go:216] action for Node "pool1-44283f2f892250c3" with CIDR "10.244.19.0/24": "add"
I0925 08:40:50.603552       1 route_controller.go:290] route spec to be created: &{ pool1-44283f2f892250c3 false [{InternalIP 10.0.4.3} {Hostname pool1-44283f2f892250c3}] 10.244.19.0/24 false}
I0925 08:40:50.603642       1 route_controller.go:304] Creating route for node pool1-44283f2f892250c3 10.244.19.0/24 with hint 0f76b027-4be6-4144-af9d-67566c6a5e3c, throttled 20µs
I0925 08:40:52.468300       1 route_controller.go:325] Created route for node pool1-44283f2f892250c3 10.244.19.0/24 with hint 0f76b027-4be6-4144-af9d-67566c6a5e3c after 1.864673235s
I0925 08:40:52.468348       1 route_controller.go:387] Patching node status pool1-44283f2f892250c3 with true previous condition was:nil

new node (pool1-44283f2f892250c3 ) was successfully initialized but LB wasn't updated.

Once i've restarted controller's pod - it has updated LB and added new node.

Controller version: v1.18.0 Controller arguments:

--allow-untagged-cloud                                                                                                                                                                                                                
--cloud-provider=hcloud                                                                                                                                                                                                               
--route-reconciliation-period=30s                                                                                                                                                                                                     
--webhook-secure-port=0                                                                                                                                                                                                               
--allocate-node-cidrs=true                                                                                                                                                                                                            
--cluster-cidr=10.244.0.0/16                                                                                                                                                                                                         --leader-elect=false

Nginx service annotations:

load-balancer.hetzner.cloud/hostname: {{ LB DNS }}
load-balancer.hetzner.cloud/ipv6-disabled: true 
load-balancer.hetzner.cloud/name: {{LB NAME }}
load-balancer.hetzner.cloud/protocol: tcp 
load-balancer.hetzner.cloud/use-private-ip: true
load-balancer.hetzner.cloud/uses-proxyprotocol: true

nmuntyanov avatar Sep 25 '23 09:09 nmuntyanov

v1.17.2 version works well

nmuntyanov avatar Sep 25 '23 10:09 nmuntyanov

Hey @nmuntyanov,

adding Nodes to the Load Balancer happens asynchronously to the initialization. By default it adds any Nodes that are initialized & marked as Ready. Maybe your Node was not marked as Ready yet, for example because the CNI was not initialized?

The logs do not contain anything about the Service controller, so the cause is hard to figure out. Can you reproduce the issue? If yes, it would be great if you can run the ccm with flag --v=4 to get some more logs from the controller.

apricote avatar Oct 04 '23 07:10 apricote

@apricote

I1017 16:46:26.870848       1 reflector.go:378] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: forcing resync
I1017 16:46:27.151016       1 controller.go:716] Syncing backends for all LB services.
I1017 16:46:27.151047       1 node_controller.go:431] Initializing node pool1-7e9ea74aec95871f with cloud provider
I1017 16:46:27.151055       1 controller.go:755] Running updateLoadBalancerHosts(len(services)==1, workers==1)
I1017 16:46:27.151141       1 controller.go:741] nodeSyncService started for service ingress-nginx/ingress-nginx-controller
I1017 16:46:27.151155       1 controller.go:795] Updating backends for load balancer ingress-nginx/ingress-nginx-controller with node set: map[pool1-187c0d729458c3b4:{} pool1-1b3b8f8fc14e8cc:{} pool1-2959c973ec8cab12:{} pool1-545a7f4c14e9eb04:{} pool1-5a3db78e45a2cbcf:{} pool1-76770d9e480a80cb:{} pool1-7a6cd031d6849d91:{} pool1-7e9ea74aec95871f:{}]
I1017 16:46:27.151254       1 load_balancers.go:283] "update Load Balancer" op="hcloud/loadBalancers.UpdateLoadBalancer" service="ingress-nginx-controller" nodes=["pool1-1b3b8f8fc14e8cc","pool1-545a7f4c14e9eb04","pool1-7a6cd031d6849d91","pool1-187c0d729458c3b4","pool1-2959c973ec8cab12","pool1-5a3db78e45a2cbcf","pool1-76770d9e480a80cb","pool1-7e9ea74aec95871f"]
I1017 16:46:27.614234       1 route_controller.go:216] action for Node "pool1-187c0d729458c3b4" with CIDR "10.244.41.0/24": "keep"
I1017 16:46:27.614277       1 route_controller.go:216] action for Node "pool1-2959c973ec8cab12" with CIDR "10.244.39.0/24": "keep"
I1017 16:46:27.614293       1 route_controller.go:216] action for Node "pool1-5a3db78e45a2cbcf" with CIDR "10.244.36.0/24": "keep"
I1017 16:46:27.614306       1 route_controller.go:216] action for Node "pool1-76770d9e480a80cb" with CIDR "10.244.38.0/24": "keep"
I1017 16:46:27.614319       1 route_controller.go:216] action for Node "pool1-7e9ea74aec95871f" with CIDR "10.244.43.0/24": "add"
I1017 16:46:27.614331       1 route_controller.go:216] action for Node "control-plane" with CIDR "10.244.0.0/24": "keep"
I1017 16:46:27.614349       1 route_controller.go:216] action for Node "pool1-1b3b8f8fc14e8cc" with CIDR "10.244.40.0/24": "keep"
I1017 16:46:27.614363       1 route_controller.go:216] action for Node "pool1-545a7f4c14e9eb04" with CIDR "10.244.37.0/24": "keep"
I1017 16:46:27.614382       1 route_controller.go:216] action for Node "pool1-7a6cd031d6849d91" with CIDR "10.244.42.0/24": "keep"
I1017 16:46:27.614404       1 route_controller.go:290] route spec to be created: &{ pool1-7e9ea74aec95871f false [{InternalIP 10.0.8.11} {Hostname pool1-7e9ea74aec95871f}] 10.244.43.0/24 false}
I1017 16:46:27.614488       1 route_controller.go:304] Creating route for node pool1-7e9ea74aec95871f 10.244.43.0/24 with hint fad9a939-eb7c-444c-87be-19b6b2c4db4b, throttled 15.8µs
I1017 16:46:27.954789       1 node_controller.go:543] Adding node label from cloud provider: beta.kubernetes.io/instance-type=cax31
I1017 16:46:27.954817       1 node_controller.go:544] Adding node label from cloud provider: node.kubernetes.io/instance-type=cax31
I1017 16:46:27.954825       1 node_controller.go:555] Adding node label from cloud provider: failure-domain.beta.kubernetes.io/zone=fsn1-dc14
I1017 16:46:27.954832       1 node_controller.go:556] Adding node label from cloud provider: topology.kubernetes.io/zone=fsn1-dc14
I1017 16:46:27.954839       1 node_controller.go:566] Adding node label from cloud provider: failure-domain.beta.kubernetes.io/region=fsn1
I1017 16:46:27.954846       1 node_controller.go:567] Adding node label from cloud provider: topology.kubernetes.io/region=fsn1
I1017 16:46:27.964645       1 node_controller.go:502] Successfully initialized node pool1-7e9ea74aec95871f with cloud provider
I1017 16:46:27.964711       1 node_controller.go:431] Initializing node pool1-7e9ea74aec95871f with cloud provider
I1017 16:46:27.964889       1 event.go:307] "Event occurred" object="pool1-7e9ea74aec95871f" fieldPath="" kind="Node" apiVersion="v1" type="Normal" reason="Synced" message="Node synced successfully"
I1017 16:46:28.140425       1 node_controller.go:543] Adding node label from cloud provider: beta.kubernetes.io/instance-type=cax31
I1017 16:46:28.140452       1 node_controller.go:544] Adding node label from cloud provider: node.kubernetes.io/instance-type=cax31
I1017 16:46:28.140458       1 node_controller.go:555] Adding node label from cloud provider: failure-domain.beta.kubernetes.io/zone=fsn1-dc14
I1017 16:46:28.140463       1 node_controller.go:556] Adding node label from cloud provider: topology.kubernetes.io/zone=fsn1-dc14
I1017 16:46:28.140469       1 node_controller.go:566] Adding node label from cloud provider: failure-domain.beta.kubernetes.io/region=fsn1
I1017 16:46:28.140475       1 node_controller.go:567] Adding node label from cloud provider: topology.kubernetes.io/region=fsn1
I1017 16:46:28.147629       1 node_controller.go:502] Successfully initialized node pool1-7e9ea74aec95871f with cloud provider
I1017 16:46:28.147768       1 event.go:307] "Event occurred" object="pool1-7e9ea74aec95871f" fieldPath="" kind="Node" apiVersion="v1" type="Normal" reason="Synced" message="Node synced successfully"
I1017 16:46:28.970406       1 route_controller.go:325] Created route for node pool1-7e9ea74aec95871f 10.244.43.0/24 with hint fad9a939-eb7c-444c-87be-19b6b2c4db4b after 1.355916554s
I1017 16:46:28.970490       1 route_controller.go:378] set node pool1-7a6cd031d6849d91 with NodeNetworkUnavailable=false was canceled because it is already set
I1017 16:46:28.970501       1 route_controller.go:378] set node pool1-187c0d729458c3b4 with NodeNetworkUnavailable=false was canceled because it is already set
I1017 16:46:28.970521       1 route_controller.go:378] set node pool1-5a3db78e45a2cbcf with NodeNetworkUnavailable=false was canceled because it is already set
I1017 16:46:28.970507       1 route_controller.go:378] set node pool1-2959c973ec8cab12 with NodeNetworkUnavailable=false was canceled because it is already set
I1017 16:46:28.970537       1 route_controller.go:378] set node pool1-76770d9e480a80cb with NodeNetworkUnavailable=false was canceled because it is already set
I1017 16:46:28.970547       1 route_controller.go:378] set node control-plane with NodeNetworkUnavailable=false was canceled because it is already set
I1017 16:46:28.970550       1 route_controller.go:378] set node pool1-1b3b8f8fc14e8cc with NodeNetworkUnavailable=false was canceled because it is already set
I1017 16:46:28.970564       1 route_controller.go:378] set node pool1-545a7f4c14e9eb04 with NodeNetworkUnavailable=false was canceled because it is already set
I1017 16:46:28.970573       1 route_controller.go:387] Patching node status pool1-7e9ea74aec95871f with true previous condition was:nil
I1017 16:46:30.195644       1 controller.go:791] It took 3.044484004 seconds to update load balancer hosts for service ingress-nginx/ingress-nginx-controller
I1017 16:46:30.195671       1 controller.go:747] nodeSyncService finished successfully for service ingress-nginx/ingress-nginx-controller
I1017 16:46:30.195686       1 controller.go:780] Finished updateLoadBalancerHosts
I1017 16:46:30.195692       1 controller.go:720] Successfully updated 1 out of 1 load balancers to direct traffic to the updated set of nodes
I1017 16:46:30.195696       1 controller.go:712] It took 3.044698765 seconds to finish syncNodes
I1017 16:46:30.195709       1 controller.go:716] Syncing backends for all LB services.
I1017 16:46:30.195713       1 controller.go:755] Running updateLoadBalancerHosts(len(services)==1, workers==1)
I1017 16:46:30.195759       1 controller.go:780] Finished updateLoadBalancerHosts
I1017 16:46:30.195765       1 controller.go:720] Successfully updated 1 out of 1 load balancers to direct traffic to the updated set of nodes
I1017 16:46:30.195767       1 controller.go:712] It took 5.852e-05 seconds to finish syncNodes

It says "Successfully updated 1 out of 1 load balancers..." but actually is not

nmuntyanov avatar Oct 17 '23 16:10 nmuntyanov

This issue has been marked as stale because it has not had recent activity. The bot will close the issue if no further action occurs.

github-actions[bot] avatar Jan 16 '24 12:01 github-actions[bot]

You did not mention which Node you expected to be added in the latest logs, I will assume pool1-7e9ea74aec95871f as that one is being initialized there.

As we can see in the following line, the Node is passed to our method and passed all of our filters. After that, nothing should stop the node from being assigned to the loadbalancer, except if it already is assigned, but then we should see an error.

I1017 16:46:27.151254       1 load_balancers.go:283] "update Load Balancer" op="hcloud/loadBalancers.UpdateLoadBalancer" service="ingress-nginx-controller" nodes=["pool1-1b3b8f8fc14e8cc","pool1-545a7f4c14e9eb04","pool1-7a6cd031d6849d91","pool1-187c0d729458c3b4","pool1-2959c973ec8cab12","pool1-5a3db78e45a2cbcf","pool1-76770d9e480a80cb","pool1-7e9ea74aec95871f"]

If this is still happening, could you add the environment variable HCLOUD_DEBUG=true to the HCCM container and send the resulting logs to julian.toelle <at> hetzner-cloud.de? This will include any calls made to the Hetzner Cloud API (with your token redacted).

apricote avatar Jan 25 '24 10:01 apricote

This issue has been marked as stale because it has not had recent activity. The bot will close the issue if no further action occurs.

github-actions[bot] avatar Apr 24 '24 12:04 github-actions[bot]