kops
kops copied to clipboard
Mistake in validation of Node Termination Handler
/kind bug
1. What kops
version are you running? The command kops version
, will display
this information.
1.28
2. What Kubernetes version are you running? kubectl version
will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops
flag.
1.28
3. What cloud provider are you using?
AWS
4. What commands did you run? What is the simplest way to reproduce this issue?
kops1.28.4 replace --force -f /path/to/kops.yaml
5. What happened after the commands executed?
Error: error replacing cluster: spec.cloudProvider.aws.nodeTerminationHandler.enableScheduledEventDraining: Forbidden: scheduled event draining cannot be disabled in Queue Processor mode
6. What did you expect to happen?
I would expect to be able to have enabledScheduledEventDraining
disabled in the config while in SQS mode. The kops validation is running this code which is problematic:
func validateNodeTerminationHandler(cluster *kops.Cluster, spec *kops.NodeTerminationHandlerSpec, fldPath *field.Path) (allErrs field.ErrorList) {
if spec.IsQueueMode() {
if spec.EnableSpotInterruptionDraining != nil && !*spec.EnableSpotInterruptionDraining {
allErrs = append(allErrs, field.Forbidden(fldPath.Child("enableSpotInterruptionDraining"), "spot interruption draining cannot be disabled in Queue Processor mode"))
}
if spec.EnableScheduledEventDraining != nil && !*spec.EnableScheduledEventDraining {
allErrs = append(allErrs, field.Forbidden(fldPath.Child("enableScheduledEventDraining"), "scheduled event draining cannot be disabled in Queue Processor mode"))
}
if !fi.ValueOf(spec.EnableRebalanceDraining) && fi.ValueOf(spec.EnableRebalanceMonitoring) {
allErrs = append(allErrs, field.Forbidden(fldPath.Child("enableRebalanceMonitoring"), "rebalance events can only drain in Queue Processor mode"))
}
}
return allErrs
}
Based on the AWS Node Termination Handler documention, enableScheduledEventDraining
is only applicable in IMDS mode
. While performing kops and kubernetes upgrades of our cluster, we ran into the error above.
Looking at the AWS Node Termination Handler source code, we can see that scheduled event draining is only used when !imdsDisabled
(or when imds is enabled)
if !imdsDisabled && nthConfig.EnableScheduledEventDraining {
//will retry 4 times with an interval of 2 seconds.
pollCtx, cancelPollCtx := context.WithTimeout(context.Background(), 8*time.Second)
err = wait.PollUntilContextCancel(pollCtx, 2*time.Second, true, func(context.Context) (done bool, err error) {
err = handleRebootUncordon(nthConfig.NodeName, interruptionEventStore, *node)
if err != nil {
log.Warn().Err(err).Msgf("Unable to complete the uncordon after reboot workflow on startup, retrying")
}
return false, nil
})
if err != nil {
log.Warn().Err(err).Msgf("All retries failed, unable to complete the uncordon after reboot workflow")
}
cancelPollCtx()
}
We should be able to disable Scheduled Event Draining while in SQS mode since it has no impact @johngmyers. Maybe I'm missing something here?
7. Please provide your cluster manifest. This is the relevant part:
nodeTerminationHandler:
enabled: true
enableSQSTerminationDraining: true
managedASGTag: "aws-node-termination-handler/managed"
cpuRequest: 200m
prometheusEnable: true
enableRebalanceMonitoring: false
enableRebalanceDraining: false
enableSpotInterruptionDraining: true
enableScheduledEventDraining: false
8. Please run the commands with most verbose logging by adding the -v 10
flag.
Paste the logs into this report, or in a gist and provide the gist link here.
9. Anything else do we need to know?