aws-node-termination-handler All retries failed, unable to complete the uncordon after reboot workflow error

All retries failed, unable to complete the uncordon after reboot workflow error

Open sushantsoni5392 opened this issue 3 years ago • 0 comments

Describe the bug Hi,

In the logs right after the NTH starts we can see errors frequently like below

2022/09/08 08:18:46 ERR Error when trying to list Nodes w/ label, falling back to direct Get lookup of node error="Get \"https://172.20.0.1:443/api/v1/nodes?labelSelector=kubernetes.io%2Fhostname%3D%3Dip-10-45-5-107.eu-central-1.compute.internal\": dial tcp 172.20.0.1:443: i/o timeout"
2022/09/08 08:18:46 WRN All retries failed, unable to complete the uncordon after reboot workflow error="timed out waiting for the condition"

I wanted to understand if this error affects anything.

Steps to reproduce

Expected outcome No errors

Application Logs The log output when experiencing the issue.

2022/09/08 08:18:14 INF aws-node-termination-handler arguments:
	dry-run: false,
	node-name: ip-10-45-5-137.eu-central-1.compute.internal,
	pod-name: aws-node-termination-handler-866sr,
	metadata-url: http://169.254.169.254,
	kubernetes-service-host: 172.20.0.1,
	kubernetes-service-port: 443,
	delete-local-data: true,
	ignore-daemon-sets: true,
	pod-termination-grace-period: -1,
	node-termination-grace-period: 120,
	enable-scheduled-event-draining: true,
	enable-spot-interruption-draining: true,
	enable-sqs-termination-draining: false,
	enable-rebalance-monitoring: true,
	enable-rebalance-draining: false,
	metadata-tries: 3,
	cordon-only: false,
	taint-node: true,
	taint-effect: NoSchedule,
	exclude-from-load-balancers: false,
	json-logging: false,
	log-level: info,
	webhook-proxy: ,
	webhook-headers: <not-displayed>,
	webhook-url: ,
	webhook-template: <not-displayed>,
	uptime-from-file: /proc/uptime,
	enable-prometheus-server: false,
	prometheus-server-port: 9092,
	emit-kubernetes-events: false,
	kubernetes-events-extra-annotations: ,
	aws-region: eu-central-1,
	queue-url: ,
	check-asg-tag-before-draining: true,
	managed-asg-tag: aws-node-termination-handler/managed,
	assume-asg-tag-propagation: false,
	aws-endpoint: ,

2022/09/08 08:18:44 ERR Error when trying to list Nodes w/ label, falling back to direct Get lookup of node error="Get \"https://172.20.0.1:443/api/v1/nodes?labelSelector=kubernetes.io%2Fhostname%3D%3Dip-10-45-5-137.eu-central-1.compute.internal\": dial tcp 172.20.0.1:443: i/o timeout"
2022/09/08 08:18:44 WRN All retries failed, unable to complete the uncordon after reboot workflow error="timed out waiting for the condition"
2022/09/08 08:18:44 INF Started watching for interruption events
2022/09/08 08:18:44 INF Kubernetes AWS Node Termination Handler has started successfully!
2022/09/08 08:18:44 INF Started watching for event cancellations
2022/09/08 08:18:44 INF Started monitoring for events event_type=SCHEDULED_EVENT
2022/09/08 08:18:44 INF Started monitoring for events event_type=SPOT_ITN
2022/09/08 08:18:44 INF Started monitoring for events event_type=REBALANCE_RECOMMENDATION
2022/09/08 08:48:44 INF event store statistics drainable-events=0 size=0

Environment

NTH App Version: 1.16.0
NTH Mode (IMDS/Queue processor): IMDS
OS/Arch: Linux
Kubernetes version: 1.21
Installation method: helm

Sep 08 '22 09:09 sushantsoni5392

aws-node-termination-handler aws-node-termination-handler copied to clipboard

All retries failed, unable to complete the uncordon after reboot workflow error

aws-node-termination-handler
aws-node-termination-handler copied to clipboard