aws-node-termination-handler
aws-node-termination-handler copied to clipboard
Node on "degraded hardware" not evacuated & terminated ASAP
I took advantage of the rare incidence of a scheduled event on a node in our test EKS cluster to update to the latest NTH (1.12.0 via latest helm chart), running in IMDS mode still. Unfortunately it didn't start draining, despite getting the below notice I was able to capture manually:
$ curl http://169.254.169.254/latest/meta-data/events/maintenance/scheduled
[ {
"NotBefore" : "25 Feb 2021 02:00:00 GMT",
"Code" : "instance-stop",
"Description" : "The instance is running on degraded hardware",
"EventId" : "instance-event-087e363178449a61e",
"State" : "active"
}
helm settings (set via helmfile but I suspect you understand):
repositories:
- name: aws-eks-charts
url: https://aws.github.io/eks-charts
releases:
- name: aws-node-termination-handler
namespace: kube-system
chart: aws-eks-charts/aws-node-termination-handler
version: 0.13.3
values:
- enableScheduledEventDraining: true
enableSpotInterruptionDraining: true
enableRebalanceMonitoring: false
taintNode: true
webhookURL: <redacted>
enablePrometheusServer: true
nodeSelector:
lifecycle: spot
podMonitor:
create: true
jsonLogging: true
logLevel: DEBUG
deamonset pod logs:
{"time":"2021-02-11T09:35:48Z","message":"Trying to get token from IMDSv2"}
{"level":"info","time":"2021-02-11T09:35:48Z","message":"Starting to serve handler /metrics, port 9092"}
{"time":"2021-02-11T09:35:48Z","message":"Got token from IMDSv2"}
{"metadata":{"accountId":"XXXXXXXX","instanceId":"i-0415e8c43a18dccc7","instanceType":"t2.medium","publicHostname":"ec2-54-229-104-67.eu-west-1.compute.amazonaws.com","publicIp":"54.229.104.67","localHostname":"ip-172-31-13-215.empire.lyst","privateIp":"172.31.13.215","availabilityZone":"eu-west-1b","region":"eu-west-1"},"time":"2021-02-11T09:35:48Z","message":"Startup Metadata Retrieved"}
{"dry_run":false,"node_name":"ip-172-31-13-215.eu-west-1.compute.internal","metadata_url":"http://169.254.169.254","kubernetes_service_host":"10.100.0.1","kubernetes_service_port":"443","delete_local_data":true,"ignore_daemon_sets":true,"pod_termination_grace_period":-1,"node_termination_grace_period":120,"enable_scheduled_event_draining":true,"enable_spot_interruption_draining":true,"enable_sqs_termination_draining":false,"enable_rebalance_monitoring":false,"metadata_tries":3,"cordon_only":false,"taint_node":true,"json_logging":true,"log_level":"DEBUG","webhook_proxy":"","uptime_from_file":"","enable_prometheus_server":true,"prometheus_server_port":9092,"aws_region":"eu-west-1","aws_endpoint":"","queue_url":"","check_asg_tag_before_draining":true,"ManagedAsgTag":"aws-node-termination-handler/managed","time":"2021-02-11T09:35:48Z","message":"aws-node-termination-handler arguments"}
{"level":"warn","time":"2021-02-11T09:35:48Z","message":"Error when trying to list Nodes w/ label, falling back to direct Get lookup of node"}
{"level":"warn","time":"2021-02-11T09:35:50Z","message":"Error when trying to list Nodes w/ label, falling back to direct Get lookup of node"}
{"level":"warn","time":"2021-02-11T09:35:52Z","message":"Error when trying to list Nodes w/ label, falling back to direct Get lookup of node"}
{"level":"warn","time":"2021-02-11T09:35:54Z","message":"Error when trying to list Nodes w/ label, falling back to direct Get lookup of node"}
{"time":"2021-02-11T09:35:56Z","message":"Started watching for interruption events"}
{"time":"2021-02-11T09:35:56Z","message":"Kubernetes AWS Node Termination Handler has started successfully!"}
{"time":"2021-02-11T09:35:56Z","message":"Started watching for event cancellations"}
{"event_type":"SCHEDULED_EVENT","time":"2021-02-11T09:35:56Z","message":"Started monitoring for events"}
{"event_type":"SPOT_ITN","time":"2021-02-11T09:35:56Z","message":"Started monitoring for events"}
{"error":"parsing time \"\" as \"2 Jan 2006 15:04:05 GMT\": cannot parse \"\" as \"2\"","time":"2021-02-11T09:35:58Z","message":"Unable to parse scheduled event end time, continuing"}
{"event":{"EventID":"instance-event-087e363178449a61e","Kind":"SCHEDULED_EVENT","Description":"instance-stop will occur between 25 Feb 2021 02:00:00 GMT and because The instance is running on degraded hardware\n","State":"active","NodeName":"ip-172-31-13-215.eu-west-1.compute.internal","InstanceID":"","StartTime":"2021-02-25T02:00:00Z","EndTime":"2021-02-25T02:00:00Z","Drained":false,"InProgress":false},"time":"2021-02-11T09:35:58Z","message":"Adding new event to the event store"}
{"error":"parsing time \"\" as \"2 Jan 2006 15:04:05 GMT\": cannot parse \"\" as \"2\"","time":"2021-02-11T09:36:00Z","message":"Unable to parse scheduled event end time, continuing"}
{"error":"parsing time \"\" as \"2 Jan 2006 15:04:05 GMT\": cannot parse \"\" as \"2\"","time":"2021-02-11T09:36:02Z","message":"Unable to parse scheduled event end time, continuing"}
{"error":"parsing time \"\" as \"2 Jan 2006 15:04:05 GMT\": cannot parse \"\" as \"2\"","time":"2021-02-11T09:36:04Z","message":"Unable to parse scheduled event end time, continuing"}
... <repeats for ever>
I haven't read through the code enough yet, but logging the above before I lose it & as I'm confused so far - the repeated error message about parsing seems to only come from decoding the raw JSON response which looks in the correct format as far as I can see. Plus that one EventID line looks to have parsed it correctly - so what is the parsing routine still complaining about and why has it not done anything yet?
Also "DEBUG" loglevel doesn't seem to have provide any extra information, which is a little unfortunate. I had originally deployed it with default "info" but from memory it looked similar to the above...
Having read through the code a bit more, is it actually intended behaviour to do nothing until the "NotBefore" time? Our SOP is to drain such nodes at earliest convenience & replace - it's a spot node, we expect interruptions. The AWS email warning seems to suggest this is the correct action:
Due to this degradation your instance could already be unreachable. We will stop your instance after 2021-02-25 02:00:00 UTC. Please take appropriate action before this time.
At very least the repeated parsing error messages (however they are being created) are a distraction that should be fixed. FWIW I've turned off taintNode and the webhookURL without any effect, just in case they were silently erroring.
Yes, this is following the intended behavior - we wait until the start time minus the grace period to actually take action. So typically we'll start the drain 2 minutes before the start time. I still think this makes sense for most scheduled events, but certainly for degraded hardware it seems better to drain immediately.
I'm surprised by the repeated logging for the end time error, I thought we were deduping events. I'll take a another look at that.
I'm also wondering if we even need to parse the not after time? It's not used for any functionality AFAIK
Thanks, I've updated the title to better match the actual issue, as I certainly would like an option (if not the default/only functionality) to begin draining and termination as soon as the notice is processed. The current scheduled event functionality doesn't match this use case - TBH I'm not really sure of any other event that I would want to wait for, given the option exists to move workloads to another node gracefully.
As to the noisy error, given the JSON document at the top of the posting there is no NotAfter element to process at https://github.com/aws/aws-node-termination-handler/blob/23b97f40987f52a28d533bd7bba7a391d46e7e3b/pkg/monitor/scheduledevent/scheduled-event-monitor.go#L94-L98 so I guess it will always error. I don't know how your event deduping works, but guess it requires an event parsed to dedupe it against known events.
If you want I could open a tiny PR to set notAfter = notBefore without a logging an error if it were blank (i.e. no change in functionality to before, but quieter), but that's just a minor point compared to waiting two weeks before evacuating dodgy hardware. Unfortunately I don't think I can re-engineer your whole processing of scheduled events.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this issue to never become stale, please ask a maintainer to apply the "stalebot-ignore" label.
IMDS mode currently waits until the start of the scheduled event window before draining the node:
https://github.com/aws/aws-node-termination-handler/blob/814bccc507967f582867e065d4be916ab74fa5c0/pkg/monitor/scheduledevent/scheduled-event-monitor.go#L108
That is arguably not the correct behavior for degraded hardware. Instead, we should drain the node as soon as we receive notification that the hardware will be reclaimed.
Queue Processor mode already works this way:
https://github.com/aws/aws-node-termination-handler/blob/814bccc507967f582867e065d4be916ab74fa5c0/pkg/monitor/sqsevent/scheduled-change-event.go#L100
We should change IMDS mode to function the same way as QP mode and start the drain immediately, which will solve this issue.