Kops cluster upgrade from 1.28.7 to 1.29.2 - warmpool instances join cluster and remain in notReady state
/kind bug
1. What kops version are you running? The command kops version, will display
this information.
1.29.2 (git-v1.29.2)
2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.
v1.29.6
3. What cloud provider are you using?
AWS
4. What commands did you run? What is the simplest way to reproduce this issue?
After editing kops config with the new k8s version I ran the following commands:
kops get assets --copy --state $KOPS_REMOTE_STATE
kops update cluster $CLUSTER_NAME --state $KOPS_REMOTE_STATE --allow-kops-downgrade
kops update cluster $CLUSTER_NAME --yes --state $KOPS_REMOTE_STATE
kops rolling-update cluster $CLUSTER_NAME --state $KOPS_REMOTE_STATE
kops rolling-update cluster $CLUSTER_NAME --yes --state $KOPS_REMOTE_STATE --post-drain-delay 75s --drain-timeout 30m
5. What happened after the commands executed?
The cluster initiation of the upgrade went smoothly. The master nodes were successfully updated; however, an issue arose during the update process of the warmPool autoscaling groups. The update became stuck as instances were being added to the cluster instead of simply undergoing warming up and subsequent powering off.
The following error was appearing in the kops update logs:
I1002 12:02:19.415658 31 instancegroups.go:565] Cluster did not pass validation, will retry in "30s": node "i-04b854ec78e845f96" of role "node" is not ready, system-node-critical pod "aws-node-4chll" is pending, system-node-critical pod "ebs-csi-node-wcz74" is pending, system-node-critical pod "efs-csi-node-7q2j8" is pending, system-node-critical pod "kube-proxy-i-04b854ec78e845f96" is pending, system-node-critical pod "node-local-dns-mdvq7" is pending.
Those nodes in the Kubernetes cluster were displayed as 'NotReady,SchedulingDisabled' when using the 'kubectl get nodes' command. I waited for 10 minutes, but there was no progress. Subsequently, I resorted to manually deleting the problematic nodes. This action successfully resolved the issue, allowing the cluster upgrade process to resume smoothly.
After completing the upgrade, I conducted another test by manually removing warmed-up nodes from the AWS console. This action led to the creation of new warmup nodes, which were subsequently added to the k8s cluster. These newly added nodes remained in a 'NotReady, SchedulingDisabled' state until I removed them manually.
Autoscaler logs for one of those nodes:
1002 13:02:34.149584 1 pre_filtering_processor.go:57] Node i-0cfcda3548f955e05 should not be processed by cluster autoscaler (no node group config)
And the relevant log line from the kops-controler:
E1002 13:02:10.796429 1 controller.go:329] "msg"="Reconciler error" "error"="error identifying node \"i-0cfcda3548f955e05\": found instance \"i-0cfcda3548f955e05\", but state is \"stopped\"" "Node"={"name":"i-0cfcda3548f955e05"} "controller"="node" "controllerGroup"="" "controllerKind"="Node" "name"="i-0cfcda3548f955e05" "namespace"="" "reconcileID"="b532008b-db8f-4273-90ad-f0bf9d40858c"
Also kube-system pods are pending to be created on those nodes for some reason:
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system aws-node-2dflq 0/2 Init:0/1 0 52m
kube-system aws-node-58x6z 0/2 Init:0/1 0 46m
kube-system aws-node-cmdrr 0/2 Init:0/1 0 54m
kube-system aws-node-sw7dv 0/2 Init:0/1 0 50m
kube-system ebs-csi-node-fbg7j 0/3 ContainerCreating 0 50m
kube-system ebs-csi-node-k5nx5 0/3 ContainerCreating 0 52m
kube-system ebs-csi-node-l82xf 0/3 ContainerCreating 0 48m
kube-system ebs-csi-node-qfg4w 0/3 ContainerCreating 0 54m
kube-system ebs-csi-node-ws7j2 0/3 ContainerCreating 0 46m
kube-system efs-csi-node-dwk4s 0/3 ContainerCreating 0 46m
kube-system efs-csi-node-g5bq8 0/3 ContainerCreating 0 52m
kube-system efs-csi-node-qg5qb 0/3 ContainerCreating 0 54m
kube-system efs-csi-node-tgcxj 0/3 ContainerCreating 0 50m
kube-system kube-proxy-i-0480ae46ad3230afc 0/1 Terminating 0 52m
kube-system kube-proxy-i-04bb59a89abc8b937 0/1 Terminating 0 50m
kube-system kube-proxy-i-0742a7e208af5b1ac 0/1 Terminating 0 46m
kube-system kube-proxy-i-0ae3c43b10efef605 0/1 Terminating 0 54m
kube-system node-local-dns-77r8p 0/1 ContainerCreating 0 52m
kube-system node-local-dns-tlcwg 0/1 ContainerCreating 0 54m
kube-system node-local-dns-vc4z2 0/1 ContainerCreating 0 50m
6. What did you expect to happen? I anticipate the warmup nodes to be activated and subsequently shut down without being integrated into the cluster.
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.
apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
creationTimestamp: null
generation: 4
name: develop.company.com
spec:
api:
loadBalancer:
class: Network
sslCertificate: arn:aws:acm:eu-west-1:1234:certificate/1111
type: Internal
assets:
containerProxy: public.ecr.aws/12344
fileRepository: https://bucket.s3.eu-west-1.amazonaws.com/
authentication:
aws: {}
authorization:
rbac: {}
certManager:
defaultIssuer: selfsigned
enabled: true
channel: stable
cloudLabels:
Prometheus: "true"
aws-region: eu-west-1
cloudProvider: aws
configBase: s3://tf-remotestate-eu-west-1-123456/kops/develop.company.com
dnsZone: ###
etcdClusters:
- cpuRequest: 200m
etcdMembers:
- instanceGroup: master-eu-west-1a
name: eu-west-1a
- instanceGroup: master-eu-west-1b
name: eu-west-1b
- instanceGroup: master-eu-west-1c
name: eu-west-1c
manager:
env:
- name: ETCD_LISTEN_METRICS_URLS
value: http://0.0.0.0:8081
- name: ETCD_METRICS
value: basic
memoryRequest: 100Mi
name: main
version: 3.4.13
- cpuRequest: 100m
etcdMembers:
- instanceGroup: master-eu-west-1a
name: eu-west-1a
- instanceGroup: master-eu-west-1b
name: eu-west-1b
- instanceGroup: master-eu-west-1c
name: eu-west-1c
memoryRequest: 100Mi
name: events
version: 3.4.13
externalPolicies:
master:
- arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
node:
- arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
- arn:aws:iam::1234:policy/nodes-extra.develop.company.com
fileAssets:
- content: |
# https://raw.githubusercontent.com/kubernetes/website/master/content/en/examples/audit/audit-policy.yaml
apiVersion: audit.k8s.io/v1 # This is required.
kind: Policy
# Don't generate audit events for all requests in RequestReceived stage.
omitStages:
- "RequestReceived"
rules:
# Log pod changes at RequestResponse level
- level: RequestResponse
resources:
- group: ""
# Resource "pods" doesn't match requests to any subresource of pods,
# which is consistent with the RBAC policy.
resources: ["pods"]
# Log "pods/log", "pods/status" at Metadata level
- level: Metadata
resources:
- group: ""
resources: ["pods/log", "pods/status"]
# Don't log requests to a configmap called "controller-leader"
- level: None
resources:
- group: ""
resources: ["configmaps"]
resourceNames: ["controller-leader"]
# Don't log watch requests by the "system:kube-proxy" on endpoints or services
- level: None
users: ["system:kube-proxy"]
verbs: ["watch"]
resources:
- group: "" # core API group
resources: ["endpoints", "services"]
# Don't log authenticated requests to certain non-resource URL paths.
- level: None
userGroups: ["system:authenticated"]
nonResourceURLs:
- "/api*" # Wildcard matching.
- "/version"
# Log the request body of configmap changes in kube-system.
- level: Request
resources:
- group: "" # core API group
resources: ["configmaps"]
# This rule only applies to resources in the "kube-system" namespace.
# The empty string "" can be used to select non-namespaced resources.
namespaces: ["kube-system"]
# Log configmap and secret changes in all other namespaces at the Metadata level.
- level: Metadata
resources:
- group: "" # core API group
resources: ["secrets", "configmaps"]
# Log all other resources in core and extensions at the Request level.
- level: Request
resources:
- group: "" # core API group
- group: "extensions" # Version of group should NOT be included.
# A catch-all rule to log all other requests at the Metadata level.
- level: Metadata
# Long-running requests like watches that fall under this rule will not
# generate an audit event in RequestReceived.
omitStages:
- "RequestReceived"
name: kubernetes-audit.yaml
path: /srv/kubernetes/assets/audit.yaml
roles:
- Master
iam:
allowContainerRegistry: true
legacy: false
serviceAccountExternalPermissions:
- aws:
policyARNs:
- arn:aws:iam::1234:policy/dub-company-aws-efs-csi-driver
name: efs-csi-controller-sa
namespace: kube-system
- aws:
policyARNs:
- arn:aws:iam::1234:policy/dub-company-aws-lb-controller
name: aws-lb-controller-aws-load-balancer-controller
namespace: kube-system
- aws:
policyARNs:
- arn:aws:iam::1234:policy/dub-company-cluster-autoscaler
name: cluster-autoscaler-aws-cluster-autoscaler
namespace: kube-system
kubeAPIServer:
authenticationTokenWebhookConfigFile: /srv/kubernetes/aws-iam-authenticator/kubeconfig.yaml
runtimeConfig:
autoscaling/v2beta1: "true"
kubeControllerManager:
horizontalPodAutoscalerCpuInitializationPeriod: 20s
horizontalPodAutoscalerDownscaleDelay: 5m0s
horizontalPodAutoscalerDownscaleStabilization: 5m0s
horizontalPodAutoscalerInitialReadinessDelay: 20s
horizontalPodAutoscalerSyncPeriod: 5s
horizontalPodAutoscalerTolerance: 100m
horizontalPodAutoscalerUpscaleDelay: 3m0s
kubeDNS:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kops.k8s.io/instancegroup
operator: In
values:
- workers-misc
externalCoreFile: |
amazonaws.com:53 {
errors
log . {
class denial error
}
health :8084
prometheus :9153
forward . 169.254.169.253 {
}
cache 30
}
.:53 {
errors
health :8080
ready :8181
autopath @kubernetes
kubernetes cluster.local {
pods verified
fallthrough in-addr.arpa ip6.arpa
}
prometheus :9153
forward . 169.254.169.253
cache 300
}
nodeLocalDNS:
cpuRequest: 25m
enabled: true
memoryRequest: 5Mi
provider: CoreDNS
tolerations:
- effect: NoSchedule
operator: Exists
kubeProxy:
metricsBindAddress: 0.0.0.0
kubelet:
anonymousAuth: false
authenticationTokenWebhook: true
authorizationMode: Webhook
maxPods: 35
resolvConf: /etc/resolv.conf
kubernetesApiAccess:
- 10.0.0.0/8
kubernetesVersion: 1.29.6
masterPublicName: api.develop.company.com
networkCIDR: 10.0.128.0/20
networkID: vpc-1234
networking:
amazonvpc:
env:
- name: WARM_IP_TARGET
value: "5"
- name: MINIMUM_IP_TARGET
value: "8"
- name: DISABLE_METRICS
value: "true"
nonMasqueradeCIDR: 100.64.0.0/10
podIdentityWebhook:
enabled: true
rollingUpdate:
maxSurge: 100%
serviceAccountIssuerDiscovery:
discoveryStore: s3://infra-eu-west-1-discovery
enableAWSOIDCProvider: true
sshAccess:
- 10.0.0.0/8
subnets:
- cidr: 10.0.128.0/22
id: subnet-123
name: eu-west-1a
type: Private
zone: eu-west-1a
- cidr: 10.0.132.0/22
id: subnet-123
name: eu-west-1b
type: Private
zone: eu-west-1b
- cidr: 10.0.136.0/22
id: subnet-132
name: eu-west-1c
type: Private
zone: eu-west-1c
- cidr: 10.0.140.0/24
id: subnet-1123
name: utility-eu-west-1a
type: Utility
zone: eu-west-1a
- cidr: 10.0.141.0/24
id: subnet-132
name: utility-eu-west-1b
type: Utility
zone: eu-west-1b
- cidr: 10.0.142.0/24
id: subnet-123
name: utility-eu-west-1c
type: Utility
zone: eu-west-1c
topology:
dns:
type: Public
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: "2024-10-02T10:12:50Z"
labels:
kops.k8s.io/cluster: develop.company.com
name: master-eu-west-1a
spec:
additionalSecurityGroups:
- sg-1234
cloudLabels:
k8s.io/cluster-autoscaler/develop.company.com: ""
k8s.io/cluster-autoscaler/disabled: ""
k8s.io/cluster-autoscaler/master-template/label: ""
image: ami-09634b5569ee59efb
machineType: t3.large
maxSize: 1
minSize: 1
nodeLabels:
kops.k8s.io/instancegroup: masters
kops.k8s.io/spotinstance: "false"
on-demand: "true"
role: Master
rootVolumeType: gp3
subnets:
- eu-west-1a
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: "2024-10-02T10:12:50Z"
labels:
kops.k8s.io/cluster: develop.company.com
name: master-eu-west-1b
spec:
additionalSecurityGroups:
- sg-123
cloudLabels:
k8s.io/cluster-autoscaler/develop.company.com: ""
k8s.io/cluster-autoscaler/disabled: ""
k8s.io/cluster-autoscaler/master-template/label: ""
image: ami-09634b5569ee59efb
machineType: t3.large
maxSize: 1
minSize: 1
nodeLabels:
kops.k8s.io/instancegroup: masters
kops.k8s.io/spotinstance: "false"
on-demand: "true"
role: Master
rootVolumeType: gp3
subnets:
- eu-west-1b
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: "2024-10-02T10:12:51Z"
labels:
kops.k8s.io/cluster: develop.company.com
name: master-eu-west-1c
spec:
additionalSecurityGroups:
- sg-123
cloudLabels:
k8s.io/cluster-autoscaler/develop.company.com: ""
k8s.io/cluster-autoscaler/disabled: ""
k8s.io/cluster-autoscaler/master-template/label: ""
image: ami-09634b5569ee59efb
machineType: t3.large
maxSize: 1
minSize: 1
nodeLabels:
kops.k8s.io/instancegroup: masters
kops.k8s.io/spotinstance: "false"
on-demand: "true"
role: Master
rootVolumeType: gp3
subnets:
- eu-west-1c
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: "2024-10-02T10:12:51Z"
generation: 2
labels:
kops.k8s.io/cluster: develop.company.com
name: workers-app
spec:
additionalSecurityGroups:
- sg-132
- sg-3322
additionalUserData:
- content: |
#!/bin/bash
echo "Starting additionalUserData"
echo "This script will execute before nodeup.sh because cloud-init executes scripts in alphabetic order by name"
export DEBIAN_FRONTEND=noninteractive
apt-get update
# Install some tools
apt install -y nfs-common # Required to make EFS volume mount
apt install -y containerd # Required for nerdctl to work, container not installed until nodeup runs
echo $(containerd --version)
wget https://github.com/containerd/nerdctl/releases/download/v1.7.2/nerdctl-1.7.2-linux-amd64.tar.gz -O /tmp/nerdctl.tar.gz
tar -C /usr/local/bin/ -xzf /tmp/nerdctl.tar.gz
echo $(nerdctl version)
apt install -y awscli
echo $(aws --version)
# Get some temporary aws ecr credentials
DOCKER_PASSWORD=$(aws ecr get-login-password --region eu-west-1)
DOCKER_USER=AWS
DOCKER_REGISTRY=1234.dkr.ecr.eu-west-1.amazonaws.com
PASSWD=$(echo "$DOCKER_USER:$DOCKER_PASSWORD" | tr -d '\n' | base64 -i -w 0)
CONFIG="\
{\n
\"auths\": {\n
\"$DOCKER_REGISTRY\": {\n
\"auth\": \"$PASSWD\"\n
}\n
}\n
}\n"
mkdir -p ~/.docker
printf "$CONFIG" > ~/.docker/config.json
echo "Pulling images from ECR"
nerdctl pull --namespace k8s.io 1234.dkr.ecr.eu-west-1.amazonaws.com/fluent-bit:2.2.2
nerdctl pull --namespace k8s.io 1234.dkr.ecr.eu-west-1.amazonaws.com/nginx-prometheus-exporter:0.9.0
nerdctl pull --namespace k8s.io public.ecr.aws/1234545/dns/k8s-dns-node-cache:1.23.0
nerdctl pull --namespace k8s.io public.ecr.aws/1234545/amazon-k8s-cni-init:v1.18.1
nerdctl pull --namespace k8s.io public.ecr.aws/1234545/amazon-k8s-cni:v1.18.1
nerdctl pull --namespace k8s.io public.ecr.aws/1234545/kube-proxy:v1.28.11
nerdctl pull --namespace k8s.io public.ecr.aws/1234545/ebs-csi-driver/aws-ebs-csi-driver:v1.30.0
nerdctl pull --namespace k8s.io public.ecr.aws/1234545/eks-distro/kubernetes-csi/node-driver-registrar:v2.10.0-eks-1-29-5
nerdctl pull --namespace k8s.io public.ecr.aws/1234545/kubernetes-csi/livenessprobe:v2.12.0-eks-1-29-5
echo "Remove and unmask containerd so it can be reinstalled by nodeup and configured how it wants it."
apt remove -y containerd
systemctl unmask containerd
echo "Finishing additionalUserData"
name: all-images.sh
type: text/x-shellscript
cloudLabels:
k8s.io/cluster-autoscaler/develop.company.com: ""
k8s.io/cluster-autoscaler/enabled: ""
k8s.io/cluster-autoscaler/node-template/label: ""
image: ami-09634b5569ee59efb
instanceMetadata:
httpPutResponseHopLimit: 1
httpTokens: required
machineType: c5.18xlarge
maxSize: 10
minSize: 1
nodeLabels:
Environment: company-develop
Group: company-develop-app
Name: company-develop-infra-app
Service: company
kops.k8s.io/instancegroup: workers-app
kops.k8s.io/spotinstance: "false"
on-demand: "true"
role: Node
rootVolumeType: gp3
subnets:
- eu-west-1a
- eu-west-1b
- eu-west-1c
suspendProcesses:
- AZRebalance
warmPool:
enableLifecycleHook: true
maxSize: 10
minSize: 5
8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.
9. Anything else do we need to know?
Hi,
I attempted to troubleshoot the issue by performing the following steps:
- Disabling warmpools and then re-enabling them, but unfortunately, the issue persisted.
- Upgrading Kops to version 1.30.1 and k8s to version 1.30.2, yet the problem persisted.
- Removing the additionalUserData scripts did not resolve the issue either.
We have the same issue!!!
Hi @hakman, @johngmyers
sorry for the direct message, just last time you helped to solve the issue quickly :).
We are heavily relying on Kops(having 40+ clusters) and using Warmpool. In the recent releases of 1.29 the Warpools have been changed with the following PRs, which brought the mentioned issue.
- https://github.com/kubernetes/kops/pull/16603
- https://github.com/kubernetes/kops/pull/15848
Would appreciate to take a look and fix them! If there is any way we can support you in making it happen quickly, please let us know.
Any update ?
Can you SSH into an instance that is still Warming and dump the logs from journalctl -u kops-configuration ?
It could be related to https://github.com/kubernetes/kops/pull/16213 or https://github.com/kubernetes/kops/pull/16460/files#diff-0e14cc1cc6d0d21dacab069a7fe628d8c3fc3287a0fb3ad4468194d613a88a5e
Hi @rifelpet,
Thank you for the reply, you can find the log file in the attachment.
Best regards, Deni kops-configuration.log
based on your logs, nodeup is definitely skipping the warmpool logic.
Just to confirm, can you run this on an instance that is still Warming and paste its output here?
TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
curl -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/autoscaling/target-lifecycle-state
Hi @rifelpet,
Here is the output from the command that you sent:
root@ip-10-22-216-163:~# TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600") root@ip-10-22-216-163:~# curl -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/autoscaling/target-lifecycle-state Warmed:Stopped
Also I am attaching the kops-configuration logs from that machine for a reference.
It does say this at the end of the kops-configuration log:
Dec 06 13:39:21 ip-10-22-216-163 nodeup[2247]: I1206 13:39:21.834281 2247 command.go:422] Found ASG lifecycle hook Dec 06 13:39:21 ip-10-22-216-163 nodeup[2247]: I1206 13:39:21.986849 2247 command.go:432] Lifecycle action completed
After that the machine is powered off, but it still stays in kubernetes cluster:
kubectl get nodes -owide | grep i-067f5984f7c86246c
i-067f5984f7c86246c NotReady,SchedulingDisabled node 15m v1.30.2 10.22.216.163 <none> Ubuntu 20.04.6 LTS 5.15.0-1068-aws containerd://1.7.16`
Best regards, Deni
I believe I know what the issue is, can you test a kops build from this PR?
If you can run the kops CLI on linux amd64, download the kops binary from here:
https://storage.googleapis.com/k8s-staging-kops/pulls/pull-kops-e2e-k8s-aws-amazonvpc/pull-a01e7b806b94881c0300d745349d3ee3254f72b6/1.31.0-beta.2+v1.31.0-beta.1-14-gec9fc7223a/linux/amd64/kops
Otherwise you'll need to checkout the branch, run make kops and use the kops binary built in .build/dist.
Set this environment variable:
export KOPS_BASE_URL="https://storage.googleapis.com/k8s-staging-kops/pulls/pull-kops-e2e-k8s-aws-amazonvpc/pull-a01e7b806b94881c0300d745349d3ee3254f72b6/1.31.0-beta.2+v1.31.0-beta.1-14-gec9fc7223a"
Then run your normal ./kops update cluster --yes and ./kops rolling-update cluster --yes commands using the custom kops cli build. If this fixes the issue then we can merge and backport this for the next patch releases.
@rifelpet any chance you could rebase that PR on 1.31 stable? I'll give it a try.
@jValdron sure thing, try this: https://github.com/kubernetes/kops/pull/17249
Heres the linux amd64 binary, or build your own from source:
https://storage.googleapis.com/k8s-staging-kops/pulls/pull-kops-e2e-k8s-aws-cilium-1-31/pull-d7454eb7cf8586042e5c36c19ce0fbb6de3629da/1.31.1+v1.31.0-4-gda60162a08/linux/amd64/kops
and set this env var:
KOPS_BASE_URL="https://storage.googleapis.com/k8s-staging-kops/pulls/pull-kops-e2e-k8s-aws-cilium-1-31/pull-d7454eb7cf8586042e5c36c19ce0fbb6de3629da/1.31.1+v1.31.0-4-gda60162a08"
Alright, I might not have the best environment to try this in. We use ECR pull through cache, so we end up re-writing the images being pulled by kOps, so the nodeup config ends up with this for the warm pool images:
warmPoolImages:
- <Account ID>.dkr.ecr.us-east-1.amazonaws.com/k8s/cilium/cilium:v1.16.5
- <Account ID>.dkr.ecr.us-east-1.amazonaws.com/k8s/cilium/hubble-relay:v1.16.5
- <Account ID>.dkr.ecr.us-east-1.amazonaws.com/k8s/cilium/operator:v1.16.5
- <Account ID>.dkr.ecr.us-east-1.amazonaws.com/k8s/kube-proxy:v1.30.8
- <Account ID>.dkr.ecr.us-east-1.amazonaws.com/k8s/provider-aws/cloud-controller-manager:v1.30.3
However, ECR requires authentication, so kops-configuration fails with:
Feb 07 15:11:36 <instance ID> nodeup[1515]: ctr: failed to resolve reference "<Account ID>.dkr.ecr.us-east-1.amazonaws.com/k8s/cilium/hubble-relay:v1.16.5": pull access denied, repository does not exist or may require authorization: authorization failed: no basic auth credentials
Feb 07 15:11:36 <instance ID> nodeup[1515]: W0207 15:11:36.630561 1515 executor.go:141] error running task "PullImageTask/<Account ID>.dkr.ecr.us-east-1.amazonaws.com/k8s/kube-proxy:v1.30.8" (3m5s remaining to succeed): error pulling docker image with 'ctr --namespace k8s.io images pull <Account ID>.dkr.ecr.us-east-1.amazonaws.com/k8s/kube-proxy:v1.30.8': exit status 1: time="2025-02-07T15:11:36Z" level=info msg="trying next host" error="pull access denied, repository does not exist or may require authorization: authorization failed: no basic auth credentials" host=<Account ID>.dkr.ecr.us-east-1.amazonaws.com
Feb 07 15:11:36 <instance ID> nodeup[1515]: ctr: failed to resolve reference "<Account ID>.dkr.ecr.us-east-1.amazonaws.com/k8s/kube-proxy:v1.30.8": pull access denied, repository does not exist or may require authorization: authorization failed: no basic auth credentials
Seems like warmPoolImages might be a new addition? Didn't run into this issue before. Is there a way to disable that functionality?
We currently pull images in a custom user data using something similar to:
content: |
#!/bin/bash
set -o errexit
set -o nounset
set -o pipefail
echo "installing aws cli"
apt install awscli -y
echo "gathering credentials"
PASSWORD=$(aws ecr get-login-password --region us-east-1)
echo "pulling images"
ctr -n k8s.io image pull <Whatever ECR image>
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
@rifelpet - do you still need the reproduction of the issue? or do you have everything for the fix?
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
@jValdron - how can we help you to debug/troubleshoot the issue, so you can provide a fix ?
@jValdron - how can we help you to debug/troubleshoot the issue, so you can provide a fix ?
@aramhakobyan in order to test this in my clusters, I would require authentication to ECR with warmPoolImages (probably through IAM instance role in my case). Probably https://github.com/kubernetes/kops/issues/12916.
@aramhakobyan yes if you can confirm whether the custom kops build fixes the problem, that would be appreciated. Heres the updated URLs:
# kops linux/amd64 CLI
https://storage.googleapis.com/k8s-staging-kops/pulls/pull-kops-e2e-k8s-aws-amazonvpc/pull-42d12644a9c36d3fa3c2c62aaa633ba2eb6e7532/1.33.0-alpha.2+v1.33.0-alpha.1-45-g98a527d703/linux/amd64/kops
# set this for the kops commands
export KOPS_BASE_URL=https://storage.googleapis.com/k8s-staging-kops/pulls/pull-kops-e2e-k8s-aws-amazonvpc/pull-42d12644a9c36d3fa3c2c62aaa633ba2eb6e7532/1.33.0-alpha.2+v1.33.0-alpha.1-45-g98a527d703
@rifelpet - thanks, then we will do it within a week and come back to you!
@rifelpet
We can not create a cluster with a fixed warmpool kops build. kops validate does not pass on control plane nodes and in the journalctl logs, it complains about networking and csi driver.
12:26:01 VALIDATION ERRORS
12:26:01 KIND NAME MESSAGE
12:26:01 Node i-031e5312a64bf4587 node "i-031e5312a64bf4587" of role "control-plane" is not ready
12:26:01 Node i-068d438e0a4019ba8 node "i-068d438e0a4019ba8" of role "control-plane" is not ready
12:26:01 Node i-0cdcc80aa6f48df8a node "i-0cdcc80aa6f48df8a" of role "control-plane" is not ready
12:26:01
12:26:01 Validation Failed
12:26:01 W0711 12:25:59.304786 2344 validate_cluster.go:238] (will retry): cluster not yet healthy
Jul 11 10:22:44 ip-10-151-29-63 kubelet[3274]: E0711 10:22:44.736898 3274 kubelet.go:2902] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"
@rifelpet - could you please update us, if you closed the issue because it was solved via #17144 ?
Yes I believe that fixed the issue