aws-node-termination-handler
aws-node-termination-handler copied to clipboard
All retries failed, unable to complete the uncordon after reboot workflow error
Describe the bug Hi,
In the logs right after the NTH starts we can see errors frequently like below
2022/09/08 08:18:46 ERR Error when trying to list Nodes w/ label, falling back to direct Get lookup of node error="Get \"https://172.20.0.1:443/api/v1/nodes?labelSelector=kubernetes.io%2Fhostname%3D%3Dip-10-45-5-107.eu-central-1.compute.internal\": dial tcp 172.20.0.1:443: i/o timeout"
2022/09/08 08:18:46 WRN All retries failed, unable to complete the uncordon after reboot workflow error="timed out waiting for the condition"
I wanted to understand if this error affects anything.
Steps to reproduce
Expected outcome No errors
Application Logs The log output when experiencing the issue.
2022/09/08 08:18:14 INF aws-node-termination-handler arguments:
dry-run: false,
node-name: ip-10-45-5-137.eu-central-1.compute.internal,
pod-name: aws-node-termination-handler-866sr,
metadata-url: http://169.254.169.254,
kubernetes-service-host: 172.20.0.1,
kubernetes-service-port: 443,
delete-local-data: true,
ignore-daemon-sets: true,
pod-termination-grace-period: -1,
node-termination-grace-period: 120,
enable-scheduled-event-draining: true,
enable-spot-interruption-draining: true,
enable-sqs-termination-draining: false,
enable-rebalance-monitoring: true,
enable-rebalance-draining: false,
metadata-tries: 3,
cordon-only: false,
taint-node: true,
taint-effect: NoSchedule,
exclude-from-load-balancers: false,
json-logging: false,
log-level: info,
webhook-proxy: ,
webhook-headers: <not-displayed>,
webhook-url: ,
webhook-template: <not-displayed>,
uptime-from-file: /proc/uptime,
enable-prometheus-server: false,
prometheus-server-port: 9092,
emit-kubernetes-events: false,
kubernetes-events-extra-annotations: ,
aws-region: eu-central-1,
queue-url: ,
check-asg-tag-before-draining: true,
managed-asg-tag: aws-node-termination-handler/managed,
assume-asg-tag-propagation: false,
aws-endpoint: ,
2022/09/08 08:18:44 ERR Error when trying to list Nodes w/ label, falling back to direct Get lookup of node error="Get \"https://172.20.0.1:443/api/v1/nodes?labelSelector=kubernetes.io%2Fhostname%3D%3Dip-10-45-5-137.eu-central-1.compute.internal\": dial tcp 172.20.0.1:443: i/o timeout"
2022/09/08 08:18:44 WRN All retries failed, unable to complete the uncordon after reboot workflow error="timed out waiting for the condition"
2022/09/08 08:18:44 INF Started watching for interruption events
2022/09/08 08:18:44 INF Kubernetes AWS Node Termination Handler has started successfully!
2022/09/08 08:18:44 INF Started watching for event cancellations
2022/09/08 08:18:44 INF Started monitoring for events event_type=SCHEDULED_EVENT
2022/09/08 08:18:44 INF Started monitoring for events event_type=SPOT_ITN
2022/09/08 08:18:44 INF Started monitoring for events event_type=REBALANCE_RECOMMENDATION
2022/09/08 08:48:44 INF event store statistics drainable-events=0 size=0
Environment
- NTH App Version: 1.16.0
- NTH Mode (IMDS/Queue processor): IMDS
- OS/Arch: Linux
- Kubernetes version: 1.21
- Installation method: helm