eksctl
eksctl copied to clipboard
[Bug] `Error: re-listing nodes: Unauthorized` when running `eksctl create nodegroup`
What were you trying to accomplish?
We run a multi OS k8s cluster comprising of both Linux and Windows nodes. We've been successfully using eksctl for the setup and upgrade of the cluster for close to 18 months. "Recently" we started encountering Error: re-listing nodes: Unauthorized
when our windows node groups were being created/upgraded. From the looks of our eksctl upgrades, we bumped from 0.89.0
to 0.99.0
around the time the issue started to occur. We're currently running on 0.101.0
.
Our windows node replacement process works similar to how the eksctl documentation describes it should work.
- We have an existing node group named
windows-ng
- We create a new node group named
windows-temp-ng
with the new AMI and or other node configuration. - Once the
windows-temp-ng
joins the cluster, we then destroywindows-ng
- Once the destroy of
windows-ng
completes, we re-create thewindows-ng
node group with the same configuration as the temp node group. - Once the final version of
windows-ng
joins, we deletewindows-temp-ng
This process has worked for a long time, mostly without incident for over a year, until recently. (star wipe to next section)
What happened?
At any of the create nodegroup
steps in the flow, we started to get the following error.
Tue, 21 Jun 2022 00:44:22 GMT TASK [eks : recreate primary windows node group from cluster.yaml] *************
Tue, 21 Jun 2022 00:44:22 GMT Tuesday 21 June 2022 00:44:22 +0000 (0:05:52.301) 0:08:43.964 **********
Tue, 21 Jun 2022 01:01:42 GMT fatal: [localhost]: FAILED! => {
Tue, 21 Jun 2022 01:01:42 GMT "changed": true,
Tue, 21 Jun 2022 01:01:42 GMT "cmd": [
Tue, 21 Jun 2022 01:01:42 GMT "eksctl",
Tue, 21 Jun 2022 01:01:42 GMT "create",
Tue, 21 Jun 2022 01:01:42 GMT "nodegroup",
Tue, 21 Jun 2022 01:01:42 GMT "-f",
Tue, 21 Jun 2022 01:01:42 GMT "/tmp/cluster.yaml",
Tue, 21 Jun 2022 01:01:42 GMT "--include",
Tue, 21 Jun 2022 01:01:42 GMT "windows-ng"
Tue, 21 Jun 2022 01:01:42 GMT ],
Tue, 21 Jun 2022 01:01:42 GMT "delta": "0:17:20.134806",
Tue, 21 Jun 2022 01:01:42 GMT "end": "2022-06-21 01:01:42.767383",
Tue, 21 Jun 2022 01:01:42 GMT "rc": 1,
Tue, 21 Jun 2022 01:01:42 GMT "start": "2022-06-21 00:44:22.632577"
Tue, 21 Jun 2022 01:01:42 GMT }
Tue, 21 Jun 2022 01:01:42 GMT
Tue, 21 Jun 2022 01:01:42 GMT STDOUT:
Tue, 21 Jun 2022 01:01:42 GMT
Tue, 21 Jun 2022 01:01:42 GMT 2022-06-21 00:44:28 [ℹ] nodegroup "windows-ng" will use "ami-059c60541a2ffa6c9" [WindowsServer2019FullContainer/1.21]
Tue, 21 Jun 2022 01:01:42 GMT 2022-06-21 00:44:28 [ℹ] nodegroup "linux" will use "" [AmazonLinux2/1.21]
Tue, 21 Jun 2022 01:01:42 GMT 2022-06-21 00:44:28 [ℹ] nodegroup "spot-linux" will use "" [AmazonLinux2/1.21]
Tue, 21 Jun 2022 01:01:42 GMT 2022-06-21 00:44:32 [ℹ] 3 existing nodegroup(s) (linux,spot-linux,windows-ng-temp) will be excluded
Tue, 21 Jun 2022 01:01:42 GMT 2022-06-21 00:44:32 [ℹ] combined include rules: windows-ng
Tue, 21 Jun 2022 01:01:42 GMT 2022-06-21 00:44:32 [ℹ] 1 nodegroup (windows-ng) was included (based on the include/exclude rules)
Tue, 21 Jun 2022 01:01:42 GMT 2022-06-21 00:44:32 [ℹ] will create a CloudFormation stack for each of 1 nodegroups in cluster "staging-main41"
Tue, 21 Jun 2022 01:01:42 GMT 2022-06-21 00:44:32 [ℹ]
Tue, 21 Jun 2022 01:01:42 GMT 2 sequential tasks: { fix cluster compatibility, 1 task: { 1 task: { create nodegroup "windows-ng" } }
Tue, 21 Jun 2022 01:01:42 GMT }
Tue, 21 Jun 2022 01:01:42 GMT 2022-06-21 00:44:32 [ℹ] checking cluster stack for missing resources
Tue, 21 Jun 2022 01:01:42 GMT 2022-06-21 00:44:36 [ℹ] cluster stack has all required resources
Tue, 21 Jun 2022 01:01:42 GMT 2022-06-21 00:44:36 [ℹ] building nodegroup stack "eksctl-staging-main41-nodegroup-windows-ng"
Tue, 21 Jun 2022 01:01:42 GMT 2022-06-21 00:44:36 [ℹ] deploying stack "eksctl-staging-main41-nodegroup-windows-ng"
Tue, 21 Jun 2022 01:01:42 GMT 2022-06-21 00:44:36 [ℹ] waiting for CloudFormation stack "eksctl-staging-main41-nodegroup-windows-ng"
Tue, 21 Jun 2022 01:01:42 GMT 2022-06-21 00:45:07 [ℹ] waiting for CloudFormation stack "eksctl-staging-main41-nodegroup-windows-ng"
Tue, 21 Jun 2022 01:01:42 GMT 2022-06-21 00:45:46 [ℹ] waiting for CloudFormation stack "eksctl-staging-main41-nodegroup-windows-ng"
Tue, 21 Jun 2022 01:01:42 GMT 2022-06-21 00:46:54 [ℹ] waiting for CloudFormation stack "eksctl-staging-main41-nodegroup-windows-ng"
Tue, 21 Jun 2022 01:01:42 GMT 2022-06-21 00:47:25 [ℹ] waiting for CloudFormation stack "eksctl-staging-main41-nodegroup-windows-ng"
Tue, 21 Jun 2022 01:01:42 GMT 2022-06-21 00:48:01 [ℹ] waiting for CloudFormation stack "eksctl-staging-main41-nodegroup-windows-ng"
Tue, 21 Jun 2022 01:01:42 GMT 2022-06-21 00:48:03 [ℹ] no tasks
Tue, 21 Jun 2022 01:01:42 GMT 2022-06-21 00:48:03 [ℹ] adding identity "arn:aws:iam::xxxx:role/eksctl-staging-main41-nodegroup-w-NodeInstanceRole-6LO8B0ZYXR17" to auth ConfigMap
Tue, 21 Jun 2022 01:01:42 GMT 2022-06-21 00:48:03 [ℹ] nodegroup "windows-ng" has 0 node(s)
Tue, 21 Jun 2022 01:01:42 GMT 2022-06-21 00:48:03 [ℹ] waiting for at least 4 node(s) to become ready in "windows-ng"
Tue, 21 Jun 2022 01:01:42 GMT
Tue, 21 Jun 2022 01:01:42 GMT
Tue, 21 Jun 2022 01:01:42 GMT STDERR:
Tue, 21 Jun 2022 01:01:42 GMT
Tue, 21 Jun 2022 01:01:42 GMT Error: re-listing nodes: Unauthorized
Tue, 21 Jun 2022 01:01:42 GMT
Tue, 21 Jun 2022 01:01:42 GMT
Tue, 21 Jun 2022 01:01:42 GMT MSG:
Tue, 21 Jun 2022 01:01:42 GMT
Tue, 21 Jun 2022 01:01:42 GMT non-zero return code
How to reproduce it?
I have no documented cases of this issue occurring when creating a cluster directly from our config. It seems to only happen when running eksctl create nodegroup
. Our automation triggers this when we've detected that the AMI or some other component of the existing node group is not the desired version.
Here's the config file template we render when creating the cluster or upgrading nodes. If you need a rendered example, I can provide that as well. I will note that each environment that we see this in is Datadog enabled, so the prebootstrap commands are executed. We happen to be removing these bootstrap commands, but haven't tested fully to say if that has any impact here or not yet.
apiVersion: 'eksctl.io/v1alpha5'
kind: 'ClusterConfig'
metadata:
name: '{{ cluster_name }}'
region: '{{ deploy_region }}'
version: '{{ k8s_version }}'
vpc:
id: '{{ vpc_id }}'
cidr: '{{ vpc_cidr }}'
subnets:
private:
us-east-2a:
id: '{{ private_subnet_a }}'
cidr: '{{ private_subnet_a_cidr_block }}'
us-east-2b:
id: '{{ private_subnet_b }}'
cidr: '{{ private_subnet_b_cidr_block }}'
us-east-2c:
id: '{{ private_subnet_c }}'
cidr: '{{ private_subnet_c_cidr_block }}'
public:
us-east-2a:
id: '{{ public_subnet_a }}'
cidr: '{{ public_subnet_a_cidr_block }}'
us-east-2b:
id: '{{ public_subnet_b }}'
cidr: '{{ public_subnet_b_cidr_block }}'
us-east-2c:
id: '{{ public_subnet_c }}'
cidr: '{{ public_subnet_c_cidr_block }}'
iam:
withOIDC: true
serviceAccounts:
- snip out this for compactness
managedNodeGroups:
- name: '{{ linux_node_group_name }}'
labels:
accelerated-computing: "false"
is-spot-instance: "false"
tags:
{% if datadog_enabled %}
datadog: '{{ datadog_monitoring }}'
{% endif %}
instanceType: '{{ linux_instance_type }}'
desiredCapacity: {{ linux_desired_capacity }}
minSize: {{ linux_min_size }}
maxSize: {{ linux_max_size }}
securityGroups:
attachIDs:
- '{{ rds_app_security_group }}' # this is temporary, and should be replaced with iam based auth at the pod level
- name: '{{ spot_linux_node_group_name }}'
spot: true
labels:
'{{ spot_linux_node_group_name }}': "true"
accelerated-computing: "false"
is-spot-instance: "true"
tags:
k8s.io/cluster-autoscaler/node-template/label/{{ spot_linux_node_group_name }}: "true"
k8s.io/cluster-autoscaler/node-template/label/accelerated-computing: "false"
k8s.io/cluster-autoscaler/node-template/label/is-spot-instance: "true"
{% if datadog_enabled %}
datadog: '{{ datadog_monitoring }}'
{% endif %}
minSize: {{ spot_linux_min_size }}
maxSize: {{ spot_linux_max_size }}
desiredCapacity: {{ spot_linux_desired_capacity }}
instanceTypes: {{ spot_linux_instance_types_array }}
securityGroups:
attachIDs:
- '{{ rds_app_security_group }}' # this is temporary, and should be replaced with iam based auth at the pod level
nodeGroups:
- name: '{{ windows_node_group_name }}'
tags:
k8s.io/cluster-autoscaler/enabled: "true"
k8s.io/cluster-autoscaler/{{ cluster_name }}: "owned"
instance-type: '{{ windows_instance_type }}'
eks:nodegroup-name: '{{ windows_node_group_name }}'
ami: '{{ latest_windows_ami }}'
{% if datadog_enabled %}
datadog: '{{ datadog_monitoring }}'
datadog-agent: "7-latest"
{% endif %}
ami: '{{ latest_windows_ami }}'
amiFamily: 'WindowsServer2019FullContainer'
instanceType: '{{ windows_instance_type }}'
minSize: {{ windows_min_size }}
maxSize: {{ windows_max_size }}
desiredCapacity: {{ windows_desired_capacity }}
securityGroups:
attachIDs:
- '{{ rds_app_security_group }}' # this is temporary, and should be replaced with iam based auth at the pod level
preBootstrapCommands:
{% if datadog_enabled %}
- Invoke-WebRequest https://s3.amazonaws.com/ddagent-windows-stable/datadog-agent-7-latest.amd64.msi -OutFile datadog-agent-7-latest.amd64.msi
- Start-Process -Wait msiexec -ArgumentList '/qn /i datadog-agent-7-latest.amd64.msi APIKEY="{{ datadog_api_key }}" SITE={{ datadog_site }} LOGS_ENABLED="true" APM_ENABLED="true" PROCESS_ENABLED="true" ADDLOCAL="MainApplication,NPM" EC2_USE_WINDOWS_PREFIX_DETECTION="true"'
{% endif %}
cloudWatch:
clusterLogging:
enableTypes: {{ cluster_logging_types }}
Logs Provided above, but if additional logging, with verbosity increased, is needed I will add here.
Anything else we need to know? We run eksctl from a Docker container, installed via:
# Install eksctl
echo "Installing eksctl..." && \
curl --silent --location "https://github.com/weaveworks/eksctl/releases/download/v$EKSCTL_VERSION/eksctl_Linux_amd64.tar.gz" | tar xz -C /tmp && \
mv /tmp/eksctl /usr/local/bin && \
Versions
As mentioned this seems to have started around when we upgraded from 0.89.0
to 0.99.0
currently we're running:
# eksctl info
eksctl version: 0.101.0
kubectl version: v1.22.10
OS: linux
This seems to be triggered by the node taking a lot of time to come up, and by adding the nodegroup separately from cluster creation. I tried this with a 10-minute sleep and it didn't trigger the issue, but a 15-minute sleep does so reliably:
# cat cluster.yaml
apiVersion: 'eksctl.io/v1alpha5'
kind: 'ClusterConfig'
metadata:
name: ngtest1
region: us-east-2
version: '1.22'
nodeGroups:
- name: 'l1'
instanceType: 'm5a.large'
minSize: 1
maxSize: 1
desiredCapacity: 1
preBootstrapCommands:
- sleep 900
# eksctl create cluster --without-nodegroup -f cluster.yaml --install-nvidia-plugin=false
[...]
# eksctl create nodegroup -f cluster.yaml --install-nvidia-plugin=false
[...]
Error: re-listing nodes: Unauthorized
My versions:
# eksctl info
eksctl version: 0.105.0-dev+aa76f1d4.2022-07-08T14:38:11Z
kubectl version: v1.22.11
OS: darwin
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
I believe this still needs attention, and should not be marked stale. I have just reproduced this with version 0.109.0-dev+78243b4c.2022-08-19T12:48:07Z
using the small recipe above.
I believe this still needs attention, and should not be marked stale. I have just reproduced this with version
0.109.0-dev+78243b4c.2022-08-19T12:48:07Z
using the small recipe above.
The latest eksctl release has some changes I pushed that might've fixed the issue. @pcharlan, do you have logs from the failed run and the ClusterConfig file?
@cPu1 I just reproduced using eksctl 0.112. For us it is an intermittent issue on our provision pipeline which retriggers itself (back to back provision and destroy). Please let me know whether I can provide additional information
P.S. we are using linux only nodes
@cPu1 I just reproduced using eksctl 0.112. For us it is an intermittent issue on our provision pipeline which retriggers itself (back to back provision and destroy). Please let me know whether I can provide additional information
P.S. we are using linux only nodes
@hazanmor, yes, please share the full logs, redacting any sensitive information, and the ClusterConfig file.
@cPu1 , yes, it's still happening for me with version:
eksctl version: 0.111.0-dev+9a99e9218.2022-09-09T18:35:18Z
kubectl version: v1.22.14
OS: darwin
My cluster config file and recipe for reproducing are in my comment above, https://github.com/weaveworks/eksctl/issues/5492#issuecomment-1180664206
The output from the run is:
2022-09-21 09:33:33 [ℹ] nodegroup "l1" will use "ami-0e29f637618ce9a89" [AmazonLinux2/1.22]
2022-09-21 09:33:37 [ℹ] 1 nodegroup (l1) was included (based on the include/exclude rules)
2022-09-21 09:33:37 [ℹ] will create a CloudFormation stack for each of 1 nodegroups in cluster "ngtest1"
2022-09-21 09:33:37 [ℹ]
2 sequential tasks: { fix cluster compatibility, 1 task: { 1 task: { create nodegroup "l1" } }
}
2022-09-21 09:33:37 [ℹ] checking cluster stack for missing resources
2022-09-21 09:33:39 [ℹ] cluster stack has all required resources
2022-09-21 09:33:39 [ℹ] building nodegroup stack "eksctl-ngtest1-nodegroup-l1"
2022-09-21 09:33:39 [ℹ] deploying stack "eksctl-ngtest1-nodegroup-l1"
2022-09-21 09:33:40 [ℹ] waiting for CloudFormation stack "eksctl-ngtest1-nodegroup-l1"
2022-09-21 09:34:10 [ℹ] waiting for CloudFormation stack "eksctl-ngtest1-nodegroup-l1"
2022-09-21 09:34:51 [ℹ] waiting for CloudFormation stack "eksctl-ngtest1-nodegroup-l1"
2022-09-21 09:36:00 [ℹ] waiting for CloudFormation stack "eksctl-ngtest1-nodegroup-l1"
2022-09-21 09:37:20 [ℹ] waiting for CloudFormation stack "eksctl-ngtest1-nodegroup-l1"
2022-09-21 09:37:20 [ℹ] no tasks
2022-09-21 09:37:20 [ℹ] adding identity "arn:aws:iam::[REDACTED]:role/eksctl-ngtest1-nodegroup-l1-NodeInstanceRole-3G6PTO87IBZX" to auth ConfigMap
2022-09-21 09:37:21 [ℹ] nodegroup "l1" has 0 node(s)
2022-09-21 09:37:21 [ℹ] waiting for at least 1 node(s) to become ready in "l1"
Error: re-listing nodes: Unauthorized
I install eksctl
using brew
on macOS. If there's a way to try a newer build, I'm happy to try that if the devs can't reproduce using my config. Thanks!
@pcharlan, thanks for the details. We're going to work on a fix soon.
FYI the issue is still present on
eksctl version: 0.114.0-dev+48660cbd1.2022-10-08T01:55:00Z
kubectl version: v1.22.15
OS: darwin
I have identified the issue lies with adding the NodeInstanceRole to aws-auth configmap. When I add the missing roles using a script, the nodes join the cluster and it works The command used: eksctl create iamidentitymapping --cluster $CLUSTER_NAME --arn $line --group system:nodes --group system:bootstrappers --username system:node:{{EC2PrivateDNSName}} --region $REGION
@cPu1 does your fix align with what @hazanmor mentions?
@cPu1 does your fix align with what @hazanmor mentions?
Yes, it does.
Thanks @cPu1! The 0.116.0
release made it to brew
, and I've tested it locally and everything worked as expected.
Thanks @cPu1! The
0.116.0
release made it tobrew
, and I've tested it locally and everything worked as expected.
@pcharlan, great! Thanks for trying it out.