upgrade-manager icon indicating copy to clipboard operation
upgrade-manager copied to clipboard

Upgrade manager fails to spin up new nodes due to insufficient IAM permission, but drains existing nodes causing pods to go in Pending state

Open rajakshay opened this issue 4 years ago • 0 comments

Is this a BUG REPORT or FEATURE REQUEST?: Bug Report

What happened: upgrade manager was running with strategy.mode=eager, and was unable to spin up new nodes due to incorrect IAM permission, but, it went on to drain the existing nodes in the ASG causing pods to go in Pending state.

What you expected to happen: When using strategy.mode=eager, if spinning up a new node fails, exit with an error. Do not drain the existing nodes.

How to reproduce it (as minimally and precisely as possible):

  • Forcing an upgrade by updating the worker node AMI:
image_id                         = "ami-08509bccec3a27874" -> "ami-02b73604aad3f32f1" # forces replacement
  • IAM Policy used: In the condition block, the expected value on the ASG tag "Name" was cluster_name, but, the tag "Name" had a garbage value (different value than expected).
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "",
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeInstances",
                "autoscaling:DescribeAutoScalingGroups"
            ],
            "Resource": "*"
        },
        {
            "Sid": "",
            "Effect": "Allow",
            "Action": "ec2:CreateTags",
            "Resource": "*",
            "Condition": {
                "StringEqualsIfExists": {
                    "aws:ResourceTag/Name": "<cluster_name>"
                }
            }
        },
        {
            "Sid": "",
            "Effect": "Allow",
            "Action": [
                "autoscaling:TerminateInstanceInAutoScalingGroup",
                "autoscaling:EnterStandby"
            ],
            "Resource": "*",
            "Condition": {
                "StringEqualsIfExists": {
                    "aws:ResourceTag/Name": "<cluster_name>"
                }
            }
        }
    ]
}
  • ASG Tag: Name=garbage

Anything else we need to know?:

Environment:

  • rolling-upgrade-controller version
keikoproj/rolling-upgrade-controller:0.13
  • Kubernetes version :
kubectl version -o yaml
clientVersion:
  buildDate: "2020-01-23T14:21:36Z"
  compiler: gc
  gitCommit: 59603c6e503c87169aea6106f57b9f242f64df89
  gitTreeState: clean
  gitVersion: v1.17.2
  goVersion: go1.13.6
  major: "1"
  minor: "17"
  platform: darwin/amd64
serverVersion:
  buildDate: "2020-07-17T18:48:53Z"
  compiler: gc
  gitCommit: 2ba888155c7f8093a1bc06e3336333fbdb27b3da
  gitTreeState: clean
  gitVersion: v1.16.13-eks-2ba888
  goVersion: go1.13.9
  major: "1"
  minor: 16+
  platform: linux/amd64

EKS cluster version: v1.16.13

Other debugging information (if applicable):

  • RollingUpgrade status: $ kubectl describe rollingupgrade
Name:         REDACTED
Namespace:    upgrade-manager
Labels:       <none>
Annotations:  REDACTED
API Version:  upgrademgr.keikoproj.io/v1alpha1
Kind:         RollingUpgrade
Metadata:
  Creation Timestamp:  2020-10-07T22:44:24Z
  Generation:          1
  Resource Version:    3543567
  Self Link:           /apis/upgrademgr.keikoproj.io/v1alpha1/namespaces/upgrade-manager/rollingupgrades/REDACTED
  UID:                 d0be4ab0-3149-4790-b3ec-bfa8cbfbb370
Spec:
  Asg Name:               REDACTED
  Node Interval Seconds:  300
  Post Drain:
    Post Wait Script:  echo "Pods at postWait:"
kubectl get pods --all-namespaces --field-selector spec.nodeName=${INSTANCE_NAME}
    Script:  echo "Pods at PostDrain:"
kubectl get pods --all-namespaces --field-selector spec.nodeName=${INSTANCE_NAME}
  Post Drain Delay Seconds:  90
  Post Terminate:
    Script:  echo "PostTerminate:"
kubectl get pods --all-namespaces
  Pre Drain:
    Script:  kubectl get pods --all-namespaces --field-selector spec.nodeName=${INSTANCE_NAME}
  Strategy:
    Drain Timeout:  300
    Mode:           eager
Status:
  Conditions:
    Status:               True
    Type:                 Complete
  Current Status:         error
  End Time:               2020-10-07T22:46:00Z
  Nodes Processed:        1
  Start Time:             2020-10-07T22:44:25Z
  Total Nodes:            1
  Total Processing Time:  1m35.177718506s
Events:
  Type     Reason                          Age   From             Message
  ----     ------                          ----  ----             -------
  Normal   RollingUpgradeStarted           96s   upgrade-manager  {"asgName":"REDACTED","msg":"Rolling Upgrade has started","status":"started","strategy":"randomUpdate"}
  Normal   RollingUpgradeInstanceStarted   95s   upgrade-manager  {"asgName":"REDACTED","msg":"Started Updating Instance i-0fa9db2fe978d007e, in AZ: us-west-2b","status":"in-progress","strategy":"randomUpdate"}
  Warning  RollingUpgradeFinished          0s    upgrade-manager  {"asgName":"REDACTED","info":"Rolling Upgrade as finished (status=error)","status":"error","strategy":"randomUpdate"}
  Normal   RollingUpgradeInstanceFinished  0s    upgrade-manager  {"asgName":"REDACTED","msg":"Finished Updating Instance 1/1 (Errors=1)","status":"in-progress","strategy":"randomUpdate"}
  • controller logs: $ kubectl logs
{"level":"info","ts":1602110664.6059573,"logger":"controllers.RollingUpgrade","msg":"Default strategy settings applied.","rollingupgrade":"REDACTED","updateStrategy":{"type":"randomUpdate","mode":"eager","maxUnavailable":1,"drainTimeout":300}}
{"level":"info","ts":1602110664.6060495,"logger":"controllers.RollingUpgrade","msg":"Adding obj to map: ","rollingupgrade":"REDACTED","name":"REDACTED"}
{"level":"debug","ts":1602110664.6060662,"logger":"controller-runtime.controller","msg":"Successfully Reconciled","controller":"rollingupgrade","request":"upgrade-manager/REDACTED"}
{"level":"info","ts":1602110665.2662385,"logger":"controllers.RollingUpgrade","msg":"Nodes in ASG that *might* need to be updated","rollingupgrade":"REDACTED","asgName":"REDACTED","asgSize":1}
2020/10/07 22:44:25 Max unavailable calculated for REDACTED is 1
{"level":"info","ts":1602110665.266579,"logger":"controllers.RollingUpgrade","msg":"ASG REDACTED is being processed. Requeuing","rollingupgrade":"REDACTED"}
{"level":"info","ts":1602110665.340885,"logger":"controllers.RollingUpgrade","msg":"selected instances for rotation: [{\n  AvailabilityZone: \"us-west-2b\",\n  HealthStatus: \"Healthy\",\n  InstanceId: \"i-0fa9db2fe978d007e\",\n  InstanceType: \"m5.xlarge\",\n  LifecycleState: \"InService\",\n  ProtectedFromScaleIn: false\n}]","rollingupgrade":"REDACTED"}
{"level":"info","ts":1602110665.3456457,"logger":"controllers.RollingUpgrade","msg":"launch configuration name differs","rollingupgrade":"REDACTED"}
{"level":"info","ts":1602110665.345665,"logger":"controllers.RollingUpgrade","msg":"Found instance","rollingupgrade":"REDACTED","instanceID":"i-0fa9db2fe978d007e","instanceName":"REDACTED"}
{"level":"info","ts":1602110665.3456714,"logger":"controllers.RollingUpgrade","msg":"Found instance","rollingupgrade":"REDACTED","instanceID":"i-0fa9db2fe978d007e","instanceName":"REDACTED"}
{"level":"info","ts":1602110665.3456802,"logger":"controllers.RollingUpgrade","msg":"setting instance state","rollingupgrade":"REDACTED","instanceID":"i-0fa9db2fe978d007e","instanceState":"in-progress"}
{"level":"info","ts":1602110665.459847,"logger":"controllers.RollingUpgrade","msg":"starting replacement with eager mode","rollingupgrade":"REDACTED","mode":"eager"}
{"level":"info","ts":1602110665.4598713,"logger":"controllers.RollingUpgrade","msg":"Setting to stand-by","rollingupgrade":"REDACTED","REDACTED":"i-0fa9db2fe978d007e"}
{"level":"error","ts":1602110665.5377138,"logger":"controllers.RollingUpgrade","msg":"Failed to enter standby","rollingupgrade":"REDACTED","instanceID":"i-0fa9db2fe978d007e","error":"AccessDenied: User: REDACTED is not authorized to perform: autoscaling:EnterStandby on resource: arn:aws:autoscaling:us-west-2:REDACTED:autoScalingGroup:5cf9b360-fc33-46d1-9462-278e5fb4275a:autoScalingGroupName/REDACTED\n\tstatus code: 403, request id: 6017ca32-2517-4be0-bef0-874fd772fa8a","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128\ngithub.com/keikoproj/upgrade-manager/controllers.(*RollingUpgradeReconciler).error\n\t/workspace/controllers/rollingupgrade_controller.go:1197\ngithub.com/keikoproj/upgrade-manager/controllers.(*RollingUpgradeReconciler).SetStandby\n\t/workspace/controllers/rollingupgrade_controller.go:382\ngithub.com/keikoproj/upgrade-manager/controllers.(*RollingUpgradeReconciler).UpdateInstanceEager\n\t/workspace/controllers/rollingupgrade_controller.go:998\ngithub.com/keikoproj/upgrade-manager/controllers.(*RollingUpgradeReconciler).UpdateInstance\n\t/workspace/controllers/rollingupgrade_controller.go:1101"}
{"level":"info","ts":1602110665.6101868,"logger":"controllers.RollingUpgrade","msg":"desired capacity is met","rollingupgrade":"REDACTED","inServiceCount":1}
{"level":"info","ts":1602110665.6196406,"logger":"controllers.RollingUpgrade","msg":"desired capacity is met","rollingupgrade":"REDACTED","inServiceCount":1}
{"level":"info","ts":1602110665.6196613,"logger":"controllers.RollingUpgrade","msg":"Running script","rollingupgrade":"REDACTED","script":"kubectl get pods --all-namespaces --field-selector spec.nodeName=${INSTANCE_NAME}\n"}
{"level":"info","ts":1602110666.8736444,"logger":"controllers.RollingUpgrade","msg":"Script finished","rollingupgrade":"REDACTED","output":"NAMESPACE     NAME                   READY   STATUS    RESTARTS   AGE\nREDACTED       REDACTED    1/1     Running   0          23m\nREDACTED   REDACTED         1/1     Running   1          23m\nREDACTED   kube-proxy-nnm6x       1/1     Running   0          23m\nlogging       REDACTED   1/1     Running   0          23m\n"}
{"level":"info","ts":1602110666.8736682,"logger":"controllers.RollingUpgrade","msg":"Creating a context with timeout","rollingupgrade":"REDACTED","drainTimeout":300}
{"level":"info","ts":1602110666.873679,"logger":"controllers.RollingUpgrade","msg":"Invoking kubectl drain for the node","rollingupgrade":"REDACTED","nodeName":"REDACTED"}
{"level":"info","ts":1602110666.8737397,"logger":"controllers.RollingUpgrade","msg":"Running script","rollingupgrade":"REDACTED","script":"/usr/local/bin/kubectl drain REDACTED --ignore-daemonsets=true --delete-local-data=true --force --grace-period=-1"}
{"level":"info","ts":1602110668.0690458,"logger":"controllers.RollingUpgrade","msg":"Script finished","rollingupgrade":"REDACTED","output":"node/REDACTED cordoned\nWARNING: ignoring DaemonSet-managed Pods: REDACTED, REDACTED, REDACTED, \nnode/REDACTED drained\n"}
{"level":"info","ts":1602110668.0690885,"logger":"controllers.RollingUpgrade","msg":"Kubectl drain completed for node","rollingupgrade":"REDACTED","nodeName":"REDACTED"}
{"level":"info","ts":1602110668.0691051,"logger":"controllers.RollingUpgrade","msg":"Running script","rollingupgrade":"REDACTED","script":"echo \"Pods at PostDrain:\"\nkubectl get pods --all-namespaces --field-selector spec.nodeName=${INSTANCE_NAME}\n"}
{"level":"info","ts":1602110669.060474,"logger":"controllers.RollingUpgrade","msg":"Script finished","rollingupgrade":"REDACTED","output":"Pods at PostDrain:\nNAMESPACE     NAME                   READY   STATUS    RESTARTS   AGE\nREDACTED       REDACTED    1/1     Running   0          23m\nREDACTED   REDACTED         1/1     Running   1          23m\nREDACTED   kube-proxy-nnm6x       1/1     Running   0          23m\nlogging       REDACTED   1/1     Running   0          23m\n"}
{"level":"info","ts":1602110669.0605023,"logger":"controllers.RollingUpgrade","msg":"Waiting for postDrainDelay","rollingupgrade":"REDACTED","postDrainDelay":90}
{"level":"info","ts":1602110725.266783,"logger":"controllers.RollingUpgrade","msg":"ASG REDACTED is being processed. Requeuing","rollingupgrade":"REDACTED"}
{"level":"info","ts":1602110759.0605834,"logger":"controllers.RollingUpgrade","msg":"Running script","rollingupgrade":"REDACTED","script":"echo \"Pods at postWait:\"\nkubectl get pods --all-namespaces --field-selector spec.nodeName=${INSTANCE_NAME}\n"}
{"level":"info","ts":1602110759.971736,"logger":"controllers.RollingUpgrade","msg":"Script finished","rollingupgrade":"REDACTED","output":"Pods at postWait:\nNAMESPACE     NAME                   READY   STATUS    RESTARTS   AGE\nREDACTED       REDACTED    1/1     Running   0          25m\nREDACTED   REDACTED         1/1     Running   1          25m\nREDACTED   kube-proxy-nnm6x       1/1     Running   0          25m\nlogging       REDACTED   1/1     Running   0          25m\n"}
{"level":"error","ts":1602110760.1696763,"logger":"controllers.RollingUpgrade","msg":"AccessDenied","rollingupgrade":"REDACTED","instanceID":"i-0fa9db2fe978d007e","error":"AccessDenied: User: REDACTED is not authorized to perform: autoscaling:TerminateInstanceInAutoScalingGroup on resource: arn:aws:autoscaling:us-west-2:REDACTED:autoScalingGroup:5cf9b360-fc33-46d1-9462-278e5fb4275a:autoScalingGroupName/REDACTED\n\tstatus code: 403, request id: 2c966731-40ec-4df5-94da-f1a28eca6a37","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128\ngithub.com/keikoproj/upgrade-manager/controllers.(*RollingUpgradeReconciler).error\n\t/workspace/controllers/rollingupgrade_controller.go:1197\ngithub.com/keikoproj/upgrade-manager/controllers.(*RollingUpgradeReconciler).TerminateNode\n\t/workspace/controllers/rollingupgrade_controller.go:412\ngithub.com/keikoproj/upgrade-manager/controllers.(*RollingUpgradeReconciler).DrainTerminate\n\t/workspace/controllers/rollingupgrade_controller.go:1038\ngithub.com/keikoproj/upgrade-manager/controllers.(*RollingUpgradeReconciler).UpdateInstanceEager\n\t/workspace/controllers/rollingupgrade_controller.go:1019\ngithub.com/keikoproj/upgrade-manager/controllers.(*RollingUpgradeReconciler).UpdateInstance\n\t/workspace/controllers/rollingupgrade_controller.go:1101"}
{"level":"info","ts":1602110760.174612,"logger":"controllers.RollingUpgrade","msg":"node is still joined to cluster, will wait and retry","rollingupgrade":"REDACTED","nodeName":"REDACTED","terminationSleepIntervalSeconds":30}
{"level":"error","ts":1602110760.177671,"logger":"controllers.RollingUpgrade","msg":"Failed to runRestack","rollingupgrade":"REDACTED","error":"Error updating instances, ErrorCount: 1, Errors: [AccessDenied: User: REDACTED is not authorized to perform: autoscaling:TerminateInstanceInAutoScalingGroup on resource: arn:aws:autoscaling:us-west-2:REDACTED:autoScalingGroup:5cf9b360-fc33-46d1-9462-278e5fb4275a:autoScalingGroupName/REDACTED\n\tstatus code: 403, request id: 2c966731-40ec-4df5-94da-f1a28eca6a37]","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128\ngithub.com/keikoproj/upgrade-manager/controllers.(*RollingUpgradeReconciler).error\n\t/workspace/controllers/rollingupgrade_controller.go:1197\ngithub.com/keikoproj/upgrade-manager/controllers.(*RollingUpgradeReconciler).Process\n\t/workspace/controllers/rollingupgrade_controller.go:702"}
{"level":"info","ts":1602110760.177715,"logger":"controllers.RollingUpgrade","msg":"Marked object as","rollingupgrade":"REDACTED","finalStatus":"error"}
{"level":"info","ts":1602110760.1898236,"logger":"controllers.RollingUpgrade","msg":"Deleted the entries of ASG in the cluster store","rollingupgrade":"REDACTED","asgName":"REDACTED"}
{"level":"info","ts":1602110760.1898403,"logger":"controllers.RollingUpgrade","msg":"Deleted from admission map ","rollingupgrade":"REDACTED","admissionMap":{}}
{"level":"info","ts":1602110760.190117,"logger":"controllers.RollingUpgrade","msg":"Default strategy settings applied.","rollingupgrade":"REDACTED","updateStrategy":{"type":"randomUpdate","mode":"eager","maxUnavailable":1,"drainTimeout":300}}
{"level":"info","ts":1602110760.1901827,"logger":"controllers.RollingUpgrade","msg":"Adding obj to map: ","rollingupgrade":"REDACTED","name":"REDACTED"}
{"level":"debug","ts":1602110760.1902156,"logger":"controller-runtime.controller","msg":"Successfully Reconciled","controller":"rollingupgrade","request":"upgrade-manager/REDACTED"}
{"level":"info","ts":1602110760.1902957,"logger":"controllers.RollingUpgrade","msg":"No more processing","rollingupgrade":"REDACTED","currentStatus":"error"}
{"level":"info","ts":1602110760.1903358,"logger":"controllers.RollingUpgrade","msg":"Marking object for deletion","rollingupgrade":"REDACTED"}
{"level":"info","ts":1602110760.190347,"logger":"controllers.RollingUpgrade","msg":"Deleted object from admission map","rollingupgrade":"REDACTED"}
{"level":"info","ts":1602110761.5315042,"logger":"controllers.RollingUpgrade","msg":"Deleted object from map","rollingupgrade":"","name":{"namespace":"upgrade-manager","name":"REDACTED"}}
{"level":"debug","ts":1602110761.53153,"logger":"controller-runtime.controller","msg":"Successfully Reconciled","controller":"rollingupgrade","request":"upgrade-manager/REDACTED"}
{"level":"info","ts":1602110785.2670298,"logger":"controllers.RollingUpgrade","msg":"Deleted object from map","rollingupgrade":"","name":{"namespace":"upgrade-manager","name":"REDACTED"}}
{"level":"debug","ts":1602110785.267056,"logger":"controller-runtime.controller","msg":"Successfully Reconciled","controller":"rollingupgrade","request":"upgrade-manager/REDACTED"}
{"level":"info","ts":1602110790.1791327,"logger":"controllers.RollingUpgrade","msg":"node is still joined to cluster, will wait and retry","rollingupgrade":"REDACTED","nodeName":"REDACTED","terminationSleepIntervalSeconds":30}
{"level":"info","ts":1602110820.1834471,"logger":"controllers.RollingUpgrade","msg":"node is still joined to cluster, will wait and retry","rollingupgrade":"REDACTED","nodeName":"REDACTED","terminationSleepIntervalSeconds":30}

I also think that the Rolling upgrade Design doc needs to specify the step strategy.mode, since the workflow changes based on the strategy.

rajakshay avatar Oct 07 '20 23:10 rajakshay