eksctl icon indicating copy to clipboard operation
eksctl copied to clipboard

[Bug] `Error: re-listing nodes: Unauthorized` when running `eksctl create nodegroup`

Open chrisjohnson00 opened this issue 2 years ago • 7 comments

What were you trying to accomplish?

We run a multi OS k8s cluster comprising of both Linux and Windows nodes. We've been successfully using eksctl for the setup and upgrade of the cluster for close to 18 months. "Recently" we started encountering Error: re-listing nodes: Unauthorized when our windows node groups were being created/upgraded. From the looks of our eksctl upgrades, we bumped from 0.89.0 to 0.99.0 around the time the issue started to occur. We're currently running on 0.101.0. Our windows node replacement process works similar to how the eksctl documentation describes it should work.

  • We have an existing node group named windows-ng
  • We create a new node group named windows-temp-ng with the new AMI and or other node configuration.
  • Once the windows-temp-ng joins the cluster, we then destroy windows-ng
  • Once the destroy of windows-ng completes, we re-create the windows-ng node group with the same configuration as the temp node group.
  • Once the final version of windows-ng joins, we delete windows-temp-ng

This process has worked for a long time, mostly without incident for over a year, until recently. (star wipe to next section)

What happened?

At any of the create nodegroup steps in the flow, we started to get the following error.

Tue, 21 Jun 2022 00:44:22 GMT TASK [eks : recreate primary windows node group from cluster.yaml] *************
Tue, 21 Jun 2022 00:44:22 GMT Tuesday 21 June 2022  00:44:22 +0000 (0:05:52.301)       0:08:43.964 ********** 
Tue, 21 Jun 2022 01:01:42 GMT fatal: [localhost]: FAILED! => {
Tue, 21 Jun 2022 01:01:42 GMT     "changed": true,
Tue, 21 Jun 2022 01:01:42 GMT     "cmd": [
Tue, 21 Jun 2022 01:01:42 GMT         "eksctl",
Tue, 21 Jun 2022 01:01:42 GMT         "create",
Tue, 21 Jun 2022 01:01:42 GMT         "nodegroup",
Tue, 21 Jun 2022 01:01:42 GMT         "-f",
Tue, 21 Jun 2022 01:01:42 GMT         "/tmp/cluster.yaml",
Tue, 21 Jun 2022 01:01:42 GMT         "--include",
Tue, 21 Jun 2022 01:01:42 GMT         "windows-ng"
Tue, 21 Jun 2022 01:01:42 GMT     ],
Tue, 21 Jun 2022 01:01:42 GMT     "delta": "0:17:20.134806",
Tue, 21 Jun 2022 01:01:42 GMT     "end": "2022-06-21 01:01:42.767383",
Tue, 21 Jun 2022 01:01:42 GMT     "rc": 1,
Tue, 21 Jun 2022 01:01:42 GMT     "start": "2022-06-21 00:44:22.632577"
Tue, 21 Jun 2022 01:01:42 GMT }
Tue, 21 Jun 2022 01:01:42 GMT
Tue, 21 Jun 2022 01:01:42 GMT STDOUT:
Tue, 21 Jun 2022 01:01:42 GMT
Tue, 21 Jun 2022 01:01:42 GMT 2022-06-21 00:44:28 [ℹ]  nodegroup "windows-ng" will use "ami-059c60541a2ffa6c9" [WindowsServer2019FullContainer/1.21]
Tue, 21 Jun 2022 01:01:42 GMT 2022-06-21 00:44:28 [ℹ]  nodegroup "linux" will use "" [AmazonLinux2/1.21]
Tue, 21 Jun 2022 01:01:42 GMT 2022-06-21 00:44:28 [ℹ]  nodegroup "spot-linux" will use "" [AmazonLinux2/1.21]
Tue, 21 Jun 2022 01:01:42 GMT 2022-06-21 00:44:32 [ℹ]  3 existing nodegroup(s) (linux,spot-linux,windows-ng-temp) will be excluded
Tue, 21 Jun 2022 01:01:42 GMT 2022-06-21 00:44:32 [ℹ]  combined include rules: windows-ng
Tue, 21 Jun 2022 01:01:42 GMT 2022-06-21 00:44:32 [ℹ]  1 nodegroup (windows-ng) was included (based on the include/exclude rules)
Tue, 21 Jun 2022 01:01:42 GMT 2022-06-21 00:44:32 [ℹ]  will create a CloudFormation stack for each of 1 nodegroups in cluster "staging-main41"
Tue, 21 Jun 2022 01:01:42 GMT 2022-06-21 00:44:32 [ℹ]  
Tue, 21 Jun 2022 01:01:42 GMT 2 sequential tasks: { fix cluster compatibility, 1 task: { 1 task: { create nodegroup "windows-ng" } } 
Tue, 21 Jun 2022 01:01:42 GMT }
Tue, 21 Jun 2022 01:01:42 GMT 2022-06-21 00:44:32 [ℹ]  checking cluster stack for missing resources
Tue, 21 Jun 2022 01:01:42 GMT 2022-06-21 00:44:36 [ℹ]  cluster stack has all required resources
Tue, 21 Jun 2022 01:01:42 GMT 2022-06-21 00:44:36 [ℹ]  building nodegroup stack "eksctl-staging-main41-nodegroup-windows-ng"
Tue, 21 Jun 2022 01:01:42 GMT 2022-06-21 00:44:36 [ℹ]  deploying stack "eksctl-staging-main41-nodegroup-windows-ng"
Tue, 21 Jun 2022 01:01:42 GMT 2022-06-21 00:44:36 [ℹ]  waiting for CloudFormation stack "eksctl-staging-main41-nodegroup-windows-ng"
Tue, 21 Jun 2022 01:01:42 GMT 2022-06-21 00:45:07 [ℹ]  waiting for CloudFormation stack "eksctl-staging-main41-nodegroup-windows-ng"
Tue, 21 Jun 2022 01:01:42 GMT 2022-06-21 00:45:46 [ℹ]  waiting for CloudFormation stack "eksctl-staging-main41-nodegroup-windows-ng"
Tue, 21 Jun 2022 01:01:42 GMT 2022-06-21 00:46:54 [ℹ]  waiting for CloudFormation stack "eksctl-staging-main41-nodegroup-windows-ng"
Tue, 21 Jun 2022 01:01:42 GMT 2022-06-21 00:47:25 [ℹ]  waiting for CloudFormation stack "eksctl-staging-main41-nodegroup-windows-ng"
Tue, 21 Jun 2022 01:01:42 GMT 2022-06-21 00:48:01 [ℹ]  waiting for CloudFormation stack "eksctl-staging-main41-nodegroup-windows-ng"
Tue, 21 Jun 2022 01:01:42 GMT 2022-06-21 00:48:03 [ℹ]  no tasks
Tue, 21 Jun 2022 01:01:42 GMT 2022-06-21 00:48:03 [ℹ]  adding identity "arn:aws:iam::xxxx:role/eksctl-staging-main41-nodegroup-w-NodeInstanceRole-6LO8B0ZYXR17" to auth ConfigMap
Tue, 21 Jun 2022 01:01:42 GMT 2022-06-21 00:48:03 [ℹ]  nodegroup "windows-ng" has 0 node(s)
Tue, 21 Jun 2022 01:01:42 GMT 2022-06-21 00:48:03 [ℹ]  waiting for at least 4 node(s) to become ready in "windows-ng"
Tue, 21 Jun 2022 01:01:42 GMT
Tue, 21 Jun 2022 01:01:42 GMT
Tue, 21 Jun 2022 01:01:42 GMT STDERR:
Tue, 21 Jun 2022 01:01:42 GMT
Tue, 21 Jun 2022 01:01:42 GMT Error: re-listing nodes: Unauthorized
Tue, 21 Jun 2022 01:01:42 GMT
Tue, 21 Jun 2022 01:01:42 GMT
Tue, 21 Jun 2022 01:01:42 GMT MSG:
Tue, 21 Jun 2022 01:01:42 GMT
Tue, 21 Jun 2022 01:01:42 GMT non-zero return code

How to reproduce it?

I have no documented cases of this issue occurring when creating a cluster directly from our config. It seems to only happen when running eksctl create nodegroup. Our automation triggers this when we've detected that the AMI or some other component of the existing node group is not the desired version.

Here's the config file template we render when creating the cluster or upgrading nodes. If you need a rendered example, I can provide that as well. I will note that each environment that we see this in is Datadog enabled, so the prebootstrap commands are executed. We happen to be removing these bootstrap commands, but haven't tested fully to say if that has any impact here or not yet.

apiVersion: 'eksctl.io/v1alpha5'
kind: 'ClusterConfig'

metadata:
  name: '{{ cluster_name }}'
  region: '{{ deploy_region }}'
  version: '{{ k8s_version }}'

vpc:
  id: '{{ vpc_id }}'
  cidr: '{{ vpc_cidr }}'
  subnets:
    private:
      us-east-2a:
        id: '{{ private_subnet_a }}'
        cidr: '{{ private_subnet_a_cidr_block }}'
      us-east-2b:
        id: '{{ private_subnet_b }}'
        cidr: '{{ private_subnet_b_cidr_block }}'
      us-east-2c:
        id: '{{ private_subnet_c }}'
        cidr: '{{ private_subnet_c_cidr_block }}'
    public:
      us-east-2a:
        id: '{{ public_subnet_a }}'
        cidr: '{{ public_subnet_a_cidr_block }}'
      us-east-2b:
        id: '{{ public_subnet_b }}'
        cidr: '{{ public_subnet_b_cidr_block }}'
      us-east-2c:
        id: '{{ public_subnet_c }}'
        cidr: '{{ public_subnet_c_cidr_block }}'

iam:
  withOIDC: true
  serviceAccounts:
  - snip out this for compactness

managedNodeGroups:
  - name: '{{ linux_node_group_name }}'
    labels:
      accelerated-computing: "false"
      is-spot-instance: "false"
    tags:
{% if datadog_enabled %}
      datadog: '{{ datadog_monitoring }}'
{% endif %}
    instanceType: '{{ linux_instance_type }}'
    desiredCapacity: {{ linux_desired_capacity }}
    minSize: {{ linux_min_size }}
    maxSize: {{ linux_max_size }}
    securityGroups:
      attachIDs:
        - '{{ rds_app_security_group }}'  # this is temporary, and should be replaced with iam based auth at the pod level

  - name: '{{ spot_linux_node_group_name }}'
    spot: true
    labels:
      '{{ spot_linux_node_group_name }}': "true"
      accelerated-computing: "false"
      is-spot-instance: "true"
    tags:
      k8s.io/cluster-autoscaler/node-template/label/{{ spot_linux_node_group_name }}: "true"
      k8s.io/cluster-autoscaler/node-template/label/accelerated-computing: "false"
      k8s.io/cluster-autoscaler/node-template/label/is-spot-instance: "true"
{% if datadog_enabled %}
      datadog: '{{ datadog_monitoring }}'
{% endif %}
    minSize: {{ spot_linux_min_size }}
    maxSize: {{ spot_linux_max_size }}
    desiredCapacity: {{ spot_linux_desired_capacity }}
    instanceTypes: {{ spot_linux_instance_types_array }}
    securityGroups:
      attachIDs:
        - '{{ rds_app_security_group }}'  # this is temporary, and should be replaced with iam based auth at the pod level

nodeGroups:
  - name: '{{ windows_node_group_name }}'
    tags:
      k8s.io/cluster-autoscaler/enabled: "true"
      k8s.io/cluster-autoscaler/{{ cluster_name }}: "owned"
      instance-type: '{{ windows_instance_type }}'
      eks:nodegroup-name: '{{ windows_node_group_name }}'
      ami: '{{ latest_windows_ami }}'
{% if datadog_enabled %}
      datadog: '{{ datadog_monitoring }}'
      datadog-agent: "7-latest"
{% endif %}
    ami: '{{ latest_windows_ami }}'
    amiFamily: 'WindowsServer2019FullContainer'
    instanceType: '{{ windows_instance_type }}'
    minSize: {{ windows_min_size }}
    maxSize: {{ windows_max_size }}
    desiredCapacity: {{ windows_desired_capacity }}
    securityGroups:
      attachIDs:
      - '{{ rds_app_security_group }}'  # this is temporary, and should be replaced with iam based auth at the pod level
    preBootstrapCommands:
{% if datadog_enabled %}
      - Invoke-WebRequest https://s3.amazonaws.com/ddagent-windows-stable/datadog-agent-7-latest.amd64.msi -OutFile datadog-agent-7-latest.amd64.msi
      - Start-Process -Wait msiexec -ArgumentList '/qn /i datadog-agent-7-latest.amd64.msi APIKEY="{{ datadog_api_key }}" SITE={{ datadog_site }} LOGS_ENABLED="true" APM_ENABLED="true" PROCESS_ENABLED="true" ADDLOCAL="MainApplication,NPM" EC2_USE_WINDOWS_PREFIX_DETECTION="true"'
{% endif %}
cloudWatch:
  clusterLogging:
    enableTypes: {{ cluster_logging_types }}

Logs Provided above, but if additional logging, with verbosity increased, is needed I will add here.

Anything else we need to know? We run eksctl from a Docker container, installed via:

# Install eksctl
echo "Installing eksctl..." && \
curl --silent --location "https://github.com/weaveworks/eksctl/releases/download/v$EKSCTL_VERSION/eksctl_Linux_amd64.tar.gz" | tar xz -C /tmp && \
mv /tmp/eksctl /usr/local/bin && \

Versions As mentioned this seems to have started around when we upgraded from 0.89.0 to 0.99.0 currently we're running:

# eksctl info
eksctl version: 0.101.0
kubectl version: v1.22.10
OS: linux

chrisjohnson00 avatar Jul 06 '22 19:07 chrisjohnson00

This seems to be triggered by the node taking a lot of time to come up, and by adding the nodegroup separately from cluster creation. I tried this with a 10-minute sleep and it didn't trigger the issue, but a 15-minute sleep does so reliably:

# cat cluster.yaml
apiVersion: 'eksctl.io/v1alpha5'
kind: 'ClusterConfig'

metadata:
  name: ngtest1
  region: us-east-2
  version: '1.22'

nodeGroups:
 - name: 'l1'
   instanceType: 'm5a.large'
   minSize: 1
   maxSize: 1
   desiredCapacity: 1
   preBootstrapCommands:
     - sleep 900
# eksctl create cluster --without-nodegroup -f cluster.yaml --install-nvidia-plugin=false
[...]
# eksctl create nodegroup -f cluster.yaml --install-nvidia-plugin=false
[...]
Error: re-listing nodes: Unauthorized

My versions:

# eksctl info
eksctl version: 0.105.0-dev+aa76f1d4.2022-07-08T14:38:11Z
kubectl version: v1.22.11
OS: darwin

pcharlan avatar Jul 11 '22 17:07 pcharlan

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Aug 19 '22 02:08 github-actions[bot]

I believe this still needs attention, and should not be marked stale. I have just reproduced this with version 0.109.0-dev+78243b4c.2022-08-19T12:48:07Z using the small recipe above.

pcharlan avatar Aug 24 '22 23:08 pcharlan

I believe this still needs attention, and should not be marked stale. I have just reproduced this with version 0.109.0-dev+78243b4c.2022-08-19T12:48:07Z using the small recipe above.

The latest eksctl release has some changes I pushed that might've fixed the issue. @pcharlan, do you have logs from the failed run and the ClusterConfig file?

cPu1 avatar Sep 20 '22 07:09 cPu1

@cPu1 I just reproduced using eksctl 0.112. For us it is an intermittent issue on our provision pipeline which retriggers itself (back to back provision and destroy). Please let me know whether I can provide additional information

P.S. we are using linux only nodes

hazanmor avatar Sep 21 '22 11:09 hazanmor

@cPu1 I just reproduced using eksctl 0.112. For us it is an intermittent issue on our provision pipeline which retriggers itself (back to back provision and destroy). Please let me know whether I can provide additional information

P.S. we are using linux only nodes

@hazanmor, yes, please share the full logs, redacting any sensitive information, and the ClusterConfig file.

cPu1 avatar Sep 21 '22 12:09 cPu1

@cPu1 , yes, it's still happening for me with version:

eksctl version: 0.111.0-dev+9a99e9218.2022-09-09T18:35:18Z
kubectl version: v1.22.14
OS: darwin

My cluster config file and recipe for reproducing are in my comment above, https://github.com/weaveworks/eksctl/issues/5492#issuecomment-1180664206

The output from the run is:

2022-09-21 09:33:33 [ℹ]  nodegroup "l1" will use "ami-0e29f637618ce9a89" [AmazonLinux2/1.22]
2022-09-21 09:33:37 [ℹ]  1 nodegroup (l1) was included (based on the include/exclude rules)
2022-09-21 09:33:37 [ℹ]  will create a CloudFormation stack for each of 1 nodegroups in cluster "ngtest1"
2022-09-21 09:33:37 [ℹ]  
2 sequential tasks: { fix cluster compatibility, 1 task: { 1 task: { create nodegroup "l1" } } 
}
2022-09-21 09:33:37 [ℹ]  checking cluster stack for missing resources
2022-09-21 09:33:39 [ℹ]  cluster stack has all required resources
2022-09-21 09:33:39 [ℹ]  building nodegroup stack "eksctl-ngtest1-nodegroup-l1"
2022-09-21 09:33:39 [ℹ]  deploying stack "eksctl-ngtest1-nodegroup-l1"
2022-09-21 09:33:40 [ℹ]  waiting for CloudFormation stack "eksctl-ngtest1-nodegroup-l1"
2022-09-21 09:34:10 [ℹ]  waiting for CloudFormation stack "eksctl-ngtest1-nodegroup-l1"
2022-09-21 09:34:51 [ℹ]  waiting for CloudFormation stack "eksctl-ngtest1-nodegroup-l1"
2022-09-21 09:36:00 [ℹ]  waiting for CloudFormation stack "eksctl-ngtest1-nodegroup-l1"
2022-09-21 09:37:20 [ℹ]  waiting for CloudFormation stack "eksctl-ngtest1-nodegroup-l1"
2022-09-21 09:37:20 [ℹ]  no tasks
2022-09-21 09:37:20 [ℹ]  adding identity "arn:aws:iam::[REDACTED]:role/eksctl-ngtest1-nodegroup-l1-NodeInstanceRole-3G6PTO87IBZX" to auth ConfigMap
2022-09-21 09:37:21 [ℹ]  nodegroup "l1" has 0 node(s)
2022-09-21 09:37:21 [ℹ]  waiting for at least 1 node(s) to become ready in "l1"
Error: re-listing nodes: Unauthorized

I install eksctl using brew on macOS. If there's a way to try a newer build, I'm happy to try that if the devs can't reproduce using my config. Thanks!

pcharlan avatar Sep 21 '22 16:09 pcharlan

@pcharlan, thanks for the details. We're going to work on a fix soon.

cPu1 avatar Sep 26 '22 07:09 cPu1

FYI the issue is still present on

eksctl version: 0.114.0-dev+48660cbd1.2022-10-08T01:55:00Z
kubectl version: v1.22.15
OS: darwin

pcharlan avatar Oct 10 '22 19:10 pcharlan

I have identified the issue lies with adding the NodeInstanceRole to aws-auth configmap. When I add the missing roles using a script, the nodes join the cluster and it works The command used: eksctl create iamidentitymapping --cluster $CLUSTER_NAME --arn $line --group system:nodes --group system:bootstrappers --username system:node:{{EC2PrivateDNSName}} --region $REGION

hazanmor avatar Oct 11 '22 11:10 hazanmor

@cPu1 does your fix align with what @hazanmor mentions?

chrisjohnson00 avatar Oct 12 '22 21:10 chrisjohnson00

@cPu1 does your fix align with what @hazanmor mentions?

Yes, it does.

cPu1 avatar Oct 13 '22 08:10 cPu1

Thanks @cPu1! The 0.116.0 release made it to brew, and I've tested it locally and everything worked as expected.

pcharlan avatar Oct 30 '22 22:10 pcharlan

Thanks @cPu1! The 0.116.0 release made it to brew, and I've tested it locally and everything worked as expected.

@pcharlan, great! Thanks for trying it out.

cPu1 avatar Oct 31 '22 07:10 cPu1