calico
calico copied to clipboard
Calico Operator Installation got stuck while migrating from manifest instalation
We had a calico installation without the operator with version v3.18.6 and now trying to move to operator installation. We did it on 5 clusters more less smoothly (operator version v1.28.0) then we moved those 5 to (operator version v1.28.1) after that we started migrating the rest of the clusters. While doing that we got an issue when the installation got stuck.
Expected Behavior
Calico resources migrated from the kube-system namespace used by the Calico manifests to a new calico-system namespace
Current Behavior
Typha failed to scale and failed to move the calico-node pods to the calico-system namespace. The good thing is that calico-node pods are still running in kube-system namespace.
Possible Solution
- Try the previous operator version
- Change the labels of the current nodes from
projectcalico.org/operator-node-migration=pre-operator
toprojectcalico.org/operator-node-migration=migrated
as ds has the last one in the selector
Steps to Reproduce (for bugs)
- Install the Tigera Calico operator and custom resource definitions.
helm upgrade --install calico projectcalico/tigera-operator --version v3.24.1 -f values.yaml --namespace tigera-operator
values.yaml
tigera-operator:
installation:
enabled: true
kubernetesProvider: ""
calicoNetwork:
ipPools:
- blockSize: 26
cidr: 192.168.0.0/16
natOutgoing: "Enabled"
nodeSelector: all()
nodeAddressAutodetectionV4:
kubernetes: NodeInternalIP
nodeMetricsPort: 9091
registry: registry/docker.io/
apiServer:
enabled: false
resources:
limits:
cpu: 1
memory: "2048Mi"
requests:
cpu: 200m
memory: "1024Mi"
# Configuration for the tigera operator
tigeraOperator:
image: tigera/operator
version: v1.28.1
registry: quay.io
calicoctl:
image: registry/docker.io/calico/ctl
tag: v3.24.1
- Monitor the migration status with the following command
kubectl describe tigerastatus calico
Name: calico
Namespace:
Labels: <none>
Annotations: <none>
API Version: operator.tigera.io/v1
Kind: TigeraStatus
Metadata:
Creation Timestamp: 2022-10-04T11:05:11Z
Generation: 1
Managed Fields:
API Version: operator.tigera.io/v1
Fields Type: FieldsV1
fieldsV1:
f:spec:
Manager: operator
Operation: Update
Time: 2022-10-04T11:05:11Z
API Version: operator.tigera.io/v1
Fields Type: FieldsV1
fieldsV1:
f:status:
.:
f:conditions:
Manager: operator
Operation: Update
Subresource: status
Time: 2022-10-04T11:05:16Z
Resource Version: 191417527
UID: 14524d8a-d93a-4363-890e-fdfa779dac4e
Spec:
Status:
Conditions:
Last Transition Time: 2022-10-04T11:05:21Z
Message: Failed to scale typha - Error: not enough linux nodes to schedule typha pods on, require 1 and have 0
Observed Generation: 2
Reason: ResourceScalingError
Status: True
Type: Degraded
Events: <none>
- However some typha pods are created
kubectl get pods -n calico-system
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-c9dd49845-npdwx 1/1 Running 0 101m
calico-typha-7f8477c5f6-pqzm9 1/1 Running 0 101m
csi-node-driver-2c4v5 2/2 Running 0 101m
csi-node-driver-2qxpt 0/2 Pending 0 101m
csi-node-driver-4gdvv 0/2 Pending 0 100m
csi-node-driver-4qpvf 2/2 Running 0 101m
csi-node-driver-5r557 2/2 Running 0 101m
csi-node-driver-5x8bt 0/2 Pending 0 101m
csi-node-driver-6mp27 2/2 Running 0 101m
csi-node-driver-7nd5d 0/2 Pending 0 101m
csi-node-driver-7pmrs 0/2 Pending 0 94m
csi-node-driver-8nkgg 2/2 Running 0 101m
- calico operator log
2022/10/04 11:05:03 [INFO] Version: v1.28.1
2022/10/04 11:05:03 [INFO] Go Version: go1.17.9b7
2022/10/04 11:05:03 [INFO] Go OS/Arch: linux/amd64
I1004 11:05:04.705332 1 request.go:665] Waited for 1.042182016s due to client-side throttling, not priority and fairness, request: GET:https://10.96.0.1:443/apis/batch/v1?timeout=32s
2022/10/04 11:05:05 [INFO] Active operator: proceeding
{"level":"info","ts":1664881506.7841918,"logger":"setup","msg":"Checking type of cluster","provider":""}
{"level":"info","ts":1664881506.7851846,"logger":"setup","msg":"Checking if PodSecurityPolicies are supported by the cluster","supported":true}
{"level":"info","ts":1664881506.7869463,"logger":"setup","msg":"Checking if TSEE controllers are required","required":false}
{"level":"info","ts":1664881506.8912253,"logger":"setup","msg":"starting manager"}
{"level":"info","ts":1664881506.8912084,"logger":"typha_autoscaler","msg":"Starting typha autoscaler","syncPeriod":10}
I1004 11:05:06.891301 1 leaderelection.go:248] attempting to acquire leader lease tigera-operator/operator-lock...
I1004 11:05:06.913451 1 leaderelection.go:258] successfully acquired lease tigera-operator/operator-lock
{"level":"info","ts":1664881506.9917126,"logger":"controller.apiserver-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1664881506.9917624,"logger":"controller.apiserver-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1664881506.991768,"logger":"controller.apiserver-controller","msg":"Starting EventSource","source":"kind source: /V1, Kind=ConfigMap"}
{"level":"info","ts":1664881506.9917734,"logger":"controller.apiserver-controller","msg":"Starting EventSource","source":"kind source: /V1, Kind=Secret"}
{"level":"info","ts":1664881506.991778,"logger":"controller.apiserver-controller","msg":"Starting EventSource","source":"kind source: /V1, Kind=Secret"}
{"level":"info","ts":1664881506.991782,"logger":"controller.apiserver-controller","msg":"Starting EventSource","source":"kind source: /V1, Kind=Secret"}
{"level":"info","ts":1664881506.9917858,"logger":"controller.apiserver-controller","msg":"Starting EventSource","source":"kind source: /V1, Kind=Secret"}
{"level":"info","ts":1664881506.9917898,"logger":"controller.apiserver-controller","msg":"Starting EventSource","source":"kind source: /V1, Kind=Secret"}
{"level":"info","ts":1664881506.991796,"logger":"controller.apiserver-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1664881506.9918036,"logger":"controller.apiserver-controller","msg":"Starting Controller"}
{"level":"info","ts":1664881506.9918213,"logger":"controller.tigera-installation-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1664881506.9918616,"logger":"controller.tigera-installation-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1664881506.991868,"logger":"controller.tigera-installation-controller","msg":"Starting EventSource","source":"kind source: /V1, Kind=Secret"}
{"level":"info","ts":1664881506.9918728,"logger":"controller.tigera-installation-controller","msg":"Starting EventSource","source":"kind source: /V1, Kind=ConfigMap"}
{"level":"info","ts":1664881506.9918792,"logger":"controller.tigera-installation-controller","msg":"Starting EventSource","source":"kind source: /V1, Kind=ConfigMap"}
{"level":"info","ts":1664881506.9918864,"logger":"controller.tigera-installation-controller","msg":"Starting EventSource","source":"kind source: /V1, Kind=ConfigMap"}
{"level":"info","ts":1664881506.9918902,"logger":"controller.tigera-installation-controller","msg":"Starting EventSource","source":"kind source: /V1, Kind=ConfigMap"}
{"level":"info","ts":1664881506.991894,"logger":"controller.tigera-installation-controller","msg":"Starting EventSource","source":"kind source: /V1, Kind=ConfigMap"}
{"level":"info","ts":1664881506.9918995,"logger":"controller.tigera-installation-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1664881506.9919055,"logger":"controller.tigera-installation-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1664881506.991912,"logger":"controller.tigera-installation-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1664881506.9919183,"logger":"controller.tigera-installation-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1664881506.9919267,"logger":"controller.tigera-installation-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1664881506.9919329,"logger":"controller.tigera-installation-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1664881506.991939,"logger":"controller.tigera-installation-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1664881506.991947,"logger":"controller.tigera-installation-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1664881506.991954,"logger":"controller.tigera-installation-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1664881506.9919627,"logger":"controller.tigera-installation-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1664881506.9919689,"logger":"controller.tigera-installation-controller","msg":"Starting Controller"}
{"level":"info","ts":1664881507.447693,"logger":"controller.apiserver-controller","msg":"Starting workers","worker count":1}
{"level":"info","ts":1664881507.4478097,"logger":"controller_apiserver","msg":"Reconciling APIServer","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":1664881507.4478521,"logger":"controller_apiserver","msg":"APIServer config not found","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":1664881507.4488175,"logger":"controller.tigera-installation-controller","msg":"Starting workers","worker count":1}
{"level":"info","ts":1664881507.9692104,"logger":"controller_apiserver","msg":"Reconciling APIServer","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":1664881507.9692528,"logger":"controller_apiserver","msg":"APIServer config not found","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":1664881507.969508,"logger":"windows_upgrader","msg":"Starting main loop"}
{"level":"info","ts":1664881507.977956,"logger":"controller_apiserver","msg":"Reconciling APIServer","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":1664881507.9779913,"logger":"controller_apiserver","msg":"APIServer config not found","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":1664881508.3646464,"logger":"controller_installation","msg":"adding active configmap"}
{"level":"info","ts":1664881508.6335487,"logger":"KubeAPIWarningLogger","msg":"policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+"}
{"level":"info","ts":1664881509.5906348,"logger":"controller_apiserver","msg":"Reconciling APIServer","Request.Namespace":"tigera-operator","Request.Name":"tigera-ca-private"}
{"level":"info","ts":1664881509.590689,"logger":"controller_apiserver","msg":"APIServer config not found","Request.Namespace":"tigera-operator","Request.Name":"tigera-ca-private"}
{"level":"info","ts":1664881511.8922532,"logger":"status_manager.calico","msg":"Status manager is not ready to report component statuses."}
{"level":"info","ts":1664881514.5702138,"logger":"controller_installation","msg":"Patch NodeSelector with: [{\"op\":\"add\",\"path\":\"/spec/template/spec/nodeSelector/projectcalico.org~1operator-node-migration\",\"value
:\"pre-operator\"}]","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":1664881514.7193437,"logger":"KubeAPIWarningLogger","msg":"spec.template.metadata.annotations[scheduler.alpha.kubernetes.io/critical-pod]: non-functional in v1.16+; use the \"priorityClassName\" field i
nstead"}
{"level":"info","ts":1664881514.7232409,"logger":"controller_installation","msg":"waiting for observed generation (2) to match object generation (3)","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":1664881516.8920703,"logger":"status_manager.calico","msg":"Status manager is not ready to report component statuses."}
{"level":"error","ts":1664881516.8921304,"logger":"typha_autoscaler","msg":"Failed to autoscale typha","error":"not enough linux nodes to schedule typha pods on, require 1 and have 0"}
{"level":"info","ts":1664881519.7317343,"logger":"controller_installation","msg":"waiting for kube-system/calico-node to have 61 replicas, currently at 60","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":1664881521.8914902,"logger":"status_manager.calico","msg":"Status manager is not ready to report component statuses."}
{"level":"info","ts":1664881524.7317426,"logger":"controller_installation","msg":"waiting for kube-system/calico-node to have 61 replicas, currently at 60","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":1664881526.8918767,"logger":"status_manager.calico","msg":"Status manager is not ready to report component statuses."}
{"level":"error","ts":1664881526.8919308,"logger":"typha_autoscaler","msg":"Failed to autoscale typha","error":"not enough linux nodes to schedule typha pods on, require 1 and have 0"}
{"level":"info","ts":1664881529.7316747,"logger":"controller_installation","msg":"waiting for kube-system/calico-node to have 61 replicas, currently at 60","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":1664881531.8915548,"logger":"status_manager.calico","msg":"Status manager is not ready to report component statuses."}
{"level":"info","ts":1664881534.732588,"logger":"controller_installation","msg":"waiting for kube-system/calico-node to have 61 replicas, currently at 60","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":1664881536.8914149,"logger":"status_manager.calico","msg":"Status manager is not ready to report component statuses."}
kubectl get nodes -o name | wc -l
61
kubectl get daemonsets
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
calico-node 0 0 0 0 0 kubernetes.io/os=linux,projectcalico.org/operator-node-migration=migrated 112m
csi-node-driver 52 52 29 52 29 kubernetes.io/os=linux 112m
kubectl describe deployment -n calico-system calico-typha
Name: calico-typha
Namespace: calico-system
CreationTimestamp: Tue, 04 Oct 2022 14:05:08 +0300
Labels: app.kubernetes.io/name=calico-typha
k8s-app=calico-typha
Annotations: deployment.kubernetes.io/revision: 1
Selector: k8s-app=calico-typha
Replicas: 1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 1 max unavailable, 25% max surge
Pod Template:
Labels: app.kubernetes.io/name=calico-typha
k8s-app=calico-typha
Annotations: hash.operator.tigera.io/tigera-ca-private: 7d014593994e80bde60e8998e3e165ce9c035ef0
hash.operator.tigera.io/typha-certs: ca40050371046fce3a23d83b0fa418ba8a9bc9b0
Service Account: calico-typha
Containers:
calico-typha:
Image: docker.io/calico/typha:v3.24.1
Port: 5473/TCP
Host Port: 5473/TCP
Liveness: http-get http://localhost:9098/liveness delay=0s timeout=10s period=10s #success=1 #failure=3
Readiness: http-get http://localhost:9098/readiness delay=0s timeout=10s period=10s #success=1 #failure=3
Environment:
TYPHA_LOGSEVERITYSCREEN: info
TYPHA_LOGFILEPATH: none
TYPHA_LOGSEVERITYSYS: none
TYPHA_CONNECTIONREBALANCINGMODE: kubernetes
TYPHA_DATASTORETYPE: kubernetes
TYPHA_HEALTHENABLED: true
TYPHA_HEALTHPORT: 9098
TYPHA_K8SNAMESPACE: calico-system
TYPHA_CAFILE: /etc/pki/tls/certs/tigera-ca-bundle.crt
TYPHA_SERVERCERTFILE: /typha-certs/tls.crt
TYPHA_SERVERKEYFILE: /typha-certs/tls.key
TYPHA_FIPSMODEENABLED: false
TYPHA_CLIENTCN: typha-client
KUBERNETES_SERVICE_HOST: 10.96.0.1
KUBERNETES_SERVICE_PORT: 443
Mounts:
/etc/pki/tls/certs/ from tigera-ca-bundle (ro)
/typha-certs from typha-certs (ro)
Volumes:
tigera-ca-bundle:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: tigera-ca-bundle
Optional: false
typha-certs:
Type: Secret (a volume populated by a Secret)
SecretName: typha-certs
Optional: false
Priority Class Name: system-cluster-critical
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True NewReplicaSetAvailable
OldReplicaSets: <none>
NewReplicaSet: calico-typha-7f8477c5f6 (1/1 replicas created)
Events: <none>
kubectl get deployment -n calico-system calico-typha -o yaml
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "1"
creationTimestamp: "2022-10-04T11:05:08Z"
generation: 1
labels:
app.kubernetes.io/name: calico-typha
k8s-app: calico-typha
name: calico-typha
namespace: calico-system
ownerReferences:
- apiVersion: operator.tigera.io/v1
blockOwnerDeletion: true
controller: true
kind: Installation
name: default
uid: d1c46db4-603a-4326-a73b-67c291222c38
resourceVersion: "191257535"
uid: 031a69ed-0087-4bc5-ae90-274f8b6ded6b
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 2
selector:
matchLabels:
k8s-app: calico-typha
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 1
type: RollingUpdate
template:
metadata:
annotations:
hash.operator.tigera.io/tigera-ca-private: 7d014593994e80bde60e8998e3e165ce9c035ef0
hash.operator.tigera.io/typha-certs: ca40050371046fce3a23d83b0fa418ba8a9bc9b0
creationTimestamp: null
labels:
app.kubernetes.io/name: calico-typha
k8s-app: calico-typha
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: k8s-app
operator: In
values:
- calico-typha
topologyKey: topology.kubernetes.io/zone
weight: 1
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
k8s-app: calico-typha
namespaces:
- kube-system
topologyKey: kubernetes.io/hostname
containers:
- env:
- name: TYPHA_LOGSEVERITYSCREEN
value: info
- name: TYPHA_LOGFILEPATH
value: none
- name: TYPHA_LOGSEVERITYSYS
value: none
- name: TYPHA_CONNECTIONREBALANCINGMODE
value: kubernetes
- name: TYPHA_DATASTORETYPE
value: kubernetes
- name: TYPHA_HEALTHENABLED
value: "true"
- name: TYPHA_HEALTHPORT
value: "9098"
- name: TYPHA_K8SNAMESPACE
value: calico-system
- name: TYPHA_CAFILE
value: /etc/pki/tls/certs/tigera-ca-bundle.crt
- name: TYPHA_SERVERCERTFILE
value: /typha-certs/tls.crt
- name: TYPHA_SERVERKEYFILE
value: /typha-certs/tls.key
- name: TYPHA_FIPSMODEENABLED
value: "false"
- name: TYPHA_CLIENTCN
value: typha-client
- name: KUBERNETES_SERVICE_HOST
value: 10.96.0.1
- name: KUBERNETES_SERVICE_PORT
value: "443"
image: docker.io/calico/typha:v3.24.1
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
host: localhost
path: /liveness
port: 9098
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 10
name: calico-typha
ports:
- containerPort: 5473
hostPort: 5473
name: calico-typha
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
host: localhost
path: /readiness
port: 9098
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 10
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/pki/tls/certs/
name: tigera-ca-bundle
readOnly: true
- mountPath: /typha-certs
name: typha-certs
readOnly: true
dnsPolicy: ClusterFirst
hostNetwork: true
nodeSelector:
kubernetes.io/os: linux
priorityClassName: system-cluster-critical
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: calico-typha
serviceAccountName: calico-typha
terminationGracePeriodSeconds: 0
tolerations:
- key: CriticalAddonsOnly
operator: Exists
- effect: NoSchedule
operator: Exists
- effect: NoExecute
operator: Exists
volumes:
- configMap:
defaultMode: 420
name: tigera-ca-bundle
name: tigera-ca-bundle
- name: typha-certs
secret:
defaultMode: 420
secretName: typha-certs
status:
availableReplicas: 1
conditions:
- lastTransitionTime: "2022-10-04T11:05:09Z"
lastUpdateTime: "2022-10-04T11:05:09Z"
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
- lastTransitionTime: "2022-10-04T11:05:09Z"
lastUpdateTime: "2022-10-04T11:05:16Z"
message: ReplicaSet "calico-typha-7f8477c5f6" has successfully progressed.
reason: NewReplicaSetAvailable
status: "True"
type: Progressing
observedGeneration: 1
readyReplicas: 1
replicas: 1
updatedReplicas: 1
kubectl get pods -o wide -n calico-system
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
calico-kube-controllers-c9dd49845-npdwx 1/1 Running 0 118m 192.168.56.232 ip-10-229-91-172.aws-region.compute.internal <none> <none>
calico-typha-7f8477c5f6-pqzm9 1/1 Running 0 118m 10.229.91.172 ip-10-229-91-172.aws-region.compute.internal <none> <none>
csi-node-driver-2c4v5 2/2 Running 0 118m 192.168.153.213 ip-10-229-88-195.aws-region.compute.internal <none> <none>
csi-node-driver-2qxpt 0/2 Pending 0 118m <none> <none> <none> <none>
csi-node-driver-4gdvv 0/2 Pending 0 117m <none> <none> <none> <none>
csi-node-driver-4qpvf 2/2 Running 0 118m 192.168.16.40 ip-10-229-93-106.aws-region.compute.internal <none> <none>
csi-node-driver-5r557 2/2 Running 0 118m 192.168.203.34 ip-10-229-90-165.aws-region.compute.internal <none> <none>
csi-node-driver-5x8bt 0/2 Pending 0 118m <none> <none> <none> <none>
csi-node-driver-6mp27 2/2 Running 0 118m 192.168.147.111 ip-10-229-92-103.aws-region.compute.internal <none> <none>
csi-node-driver-7nd5d 0/2 Pending 0 118m <none> <none> <none> <none>
csi-node-driver-7pmrs 0/2 Pending 0 111m <none> <none> <none> <none>
csi-node-driver-8nkgg 2/2 Running 0 118m 192.168.149.185 ip-10-229-93-0.aws-region.compute.internal <none> <none>
csi-node-driver-8xcpk 2/2 Running 0 43s 192.168.28.52 ip-10-229-95-73.aws-region.compute.internal <none> <none>
csi-node-driver-98jhf 0/2 Pending 0 109m <none> <none> <none> <none>
csi-node-driver-9vbzr 2/2 Running 0 118m 192.168.189.40 ip-10-229-85-111.aws-region.compute.internal <none> <none>
csi-node-driver-b66v2 0/2 Pending 0 111m <none> <none> <none> <none>
csi-node-driver-b94p5 2/2 Running 0 118m 192.168.98.146 ip-10-229-86-234.aws-region.compute.internal <none> <none>
csi-node-driver-bw7cb 2/2 Running 0 23m 192.168.13.87 ip-10-229-85-207.aws-region.compute.internal <none> <none>
csi-node-driver-cj2g9 0/2 Pending 0 107m <none> <none> <none> <none>
csi-node-driver-ckqf5 0/2 Pending 0 118m <none> <none> <none> <none>
csi-node-driver-dpkdt 0/2 Pending 0 107m <none> <none> <none> <none>
csi-node-driver-f686r 2/2 Running 0 118m 192.168.78.2 ip-10-229-88-121.aws-region.compute.internal <none> <none>
csi-node-driver-fl7wj 0/2 Pending 0 118m <none> <none> <none> <none>
csi-node-driver-fxmxg 0/2 Pending 0 118m <none> <none> <none> <none>
csi-node-driver-gcdf6 0/2 Pending 0 110m <none> <none> <none> <none>
csi-node-driver-gfn6z 0/2 Pending 0 109m <none> <none> <none> <none>
csi-node-driver-hgkt6 2/2 Running 0 118m 192.168.194.201 ip-10-229-84-198.aws-region.compute.internal <none> <none>
csi-node-driver-jc592 2/2 Running 0 118m 192.168.21.200 ip-10-229-95-29.aws-region.compute.internal <none> <none>
csi-node-driver-jhrjf 2/2 Running 0 118m 192.168.78.178 ip-10-229-92-25.aws-region.compute.internal <none> <none>
csi-node-driver-jnn7h 2/2 Running 0 118m 192.168.123.172 ip-10-229-91-8.aws-region.compute.internal <none> <none>
csi-node-driver-k5rn4 0/2 Pending 0 118m <none> <none> <none> <none>
csi-node-driver-kmttn 0/2 Pending 0 115m <none> <none> <none> <none>
csi-node-driver-ktmcx 2/2 Running 0 118m 192.168.168.99 ip-10-229-94-162.aws-region.compute.internal <none> <none>
csi-node-driver-ltc46 0/2 Pending 0 103m <none> <none> <none> <none>
csi-node-driver-m2ksc 2/2 Running 0 9m43s 192.168.199.183 ip-10-229-93-48.aws-region.compute.internal <none> <none>
csi-node-driver-nfx4v 0/2 Pending 0 115m <none> <none> <none> <none>
csi-node-driver-nvkhz 2/2 Running 0 118m 192.168.59.241 ip-10-229-93-19.aws-region.compute.internal <none> <none>
csi-node-driver-nxl5c 2/2 Running 0 118m 192.168.47.187 ip-10-229-88-48.aws-region.compute.internal <none> <none>
csi-node-driver-p4vps 2/2 Running 0 118m 192.168.28.133 ip-10-229-86-157.aws-region.compute.internal <none> <none>
csi-node-driver-p9s7h 0/2 Pending 0 109m <none> <none> <none> <none>
csi-node-driver-qhhlm 2/2 Running 0 118m 192.168.46.4 ip-10-229-84-251.aws-region.compute.internal <none> <none>
csi-node-driver-qq8c5 2/2 Running 0 118m 192.168.209.126 ip-10-229-94-37.aws-region.compute.internal <none> <none>
csi-node-driver-rj792 0/2 Pending 0 106m <none> <none> <none> <none>
csi-node-driver-rmv8w 2/2 Running 0 118m 192.168.56.233 ip-10-229-91-172.aws-region.compute.internal <none> <none>
csi-node-driver-rwngt 2/2 Running 0 118m 192.168.155.131 ip-10-229-88-148.aws-region.compute.internal <none> <none>
csi-node-driver-sslsc 2/2 Running 0 118m 192.168.92.200 ip-10-229-89-1.aws-region.compute.internal <none> <none>
csi-node-driver-tsb9t 2/2 Running 0 118m 192.168.35.54 ip-10-229-90-60.aws-region.compute.internal <none> <none>
csi-node-driver-vf6sb 0/2 Pending 0 104m <none> <none> <none> <none>
csi-node-driver-wfqxv 0/2 Pending 0 118m <none> <none> <none> <none>
csi-node-driver-x72d6 2/2 Running 0 118m 192.168.247.37 ip-10-229-86-223.aws-region.compute.internal <none> <none>
csi-node-driver-xgxqr 2/2 Running 0 118m 192.168.226.56 ip-10-229-89-193.aws-region.compute.internal <none> <none>
csi-node-driver-xhltj 0/2 Pending 0 102m <none> <none> <none> <none>
csi-node-driver-zhn5j 2/2 Running 0 118m 192.168.113.80 ip-10-229-89-163.aws-region.compute.internal <none> <none>
csi-node-driver-zz92k 2/2 Running 0 11m 192.168.49.41 ip-10-229-85-235.aws-region.compute.internal <none> <none>
kubectl describe daemonsets.apps -n calico-system
Name: calico-node
Selector: k8s-app=calico-node
Node-Selector: kubernetes.io/os=linux,projectcalico.org/operator-node-migration=migrated
Labels: <none>
Annotations: deprecated.daemonset.template.generation: 1
Desired Number of Nodes Scheduled: 0
Current Number of Nodes Scheduled: 0
Number of Nodes Scheduled with Up-to-date Pods: 0
Number of Nodes Scheduled with Available Pods: 0
Number of Nodes Misscheduled: 0
Pods Status: 0 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: app.kubernetes.io/name=calico-node
k8s-app=calico-node
Annotations: hash.operator.tigera.io/cni-config: 3828f6b2f7f9a34e88a4be8ae644087d12a85386
hash.operator.tigera.io/tigera-ca-private: 7d014593994e80bde60e8998e3e165ce9c035ef0
prometheus.io/port: 9091
prometheus.io/scrape: true
Service Account: calico-node
Init Containers:
flexvol-driver:
Image: /docker.io/calico/pod2daemon-flexvol:v3.24.1
Port: <none>
Host Port: <none>
Environment: <none>
Mounts:
/host/driver from flexvol-driver-host (rw)
install-cni:
Image: /docker.io/calico/cni:v3.24.1
Port: <none>
Host Port: <none>
Command:
/opt/cni/bin/install
Environment:
CNI_CONF_NAME: 10-calico.conflist
SLEEP: false
CNI_NET_DIR: /etc/cni/net.d
CNI_NETWORK_CONFIG: <set to the key 'config' of config map 'cni-config'> Optional: false
KUBERNETES_SERVICE_HOST: 10.96.0.1
KUBERNETES_SERVICE_PORT: 443
Mounts:
/host/etc/cni/net.d from cni-net-dir (rw)
/host/opt/cni/bin from cni-bin-dir (rw)
Containers:
calico-node:
Image: /docker.io/calico/node:v3.24.1
Port: <none>
Host Port: <none>
Requests:
cpu: 250m
Liveness: http-get http://localhost:9099/liveness delay=0s timeout=10s period=10s #success=1 #failure=3
Readiness: exec [/bin/calico-node -bird-ready -felix-ready] delay=0s timeout=5s period=10s #success=1 #failure=3
Environment:
DATASTORE_TYPE: kubernetes
WAIT_FOR_DATASTORE: true
CLUSTER_TYPE: k8s,operator,bgp
CALICO_DISABLE_FILE_LOGGING: false
FELIX_DEFAULTENDPOINTTOHOSTACTION: ACCEPT
FELIX_HEALTHENABLED: true
FELIX_HEALTHPORT: 9099
NODENAME: (v1:spec.nodeName)
NAMESPACE: (v1:metadata.namespace)
FELIX_TYPHAK8SNAMESPACE: calico-system
FELIX_TYPHAK8SSERVICENAME: calico-typha
FELIX_TYPHACAFILE: /etc/pki/tls/certs/tigera-ca-bundle.crt
FELIX_TYPHACERTFILE: /node-certs/tls.crt
FELIX_TYPHAKEYFILE: /node-certs/tls.key
FIPS_MODE_ENABLED: false
FELIX_TYPHACN: typha-server
CALICO_MANAGE_CNI: true
CALICO_IPV4POOL_CIDR: 192.168.0.0/16
CALICO_IPV4POOL_IPIP: Always
CALICO_IPV4POOL_BLOCK_SIZE: 26
CALICO_IPV4POOL_NODE_SELECTOR: all()
CALICO_IPV4POOL_DISABLE_BGP_EXPORT: false
FELIX_VXLANMTU: 1440
FELIX_WIREGUARDMTU: 1440
CALICO_NETWORKING_BACKEND: bird
FELIX_IPINIPMTU: 1440
IP: autodetect
IP_AUTODETECTION_METHOD: kubernetes-internal-ip
IP6: none
FELIX_IPV6SUPPORT: false
FELIX_PROMETHEUSMETRICSENABLED: true
FELIX_PROMETHEUSMETRICSPORT: 9091
KUBERNETES_SERVICE_HOST: 10.96.0.1
KUBERNETES_SERVICE_PORT: 443
Mounts:
/etc/pki/tls/certs/ from tigera-ca-bundle (ro)
/host/etc/cni/net.d from cni-net-dir (rw)
/lib/modules from lib-modules (ro)
/node-certs from node-certs (ro)
/run/xtables.lock from xtables-lock (rw)
/var/lib/calico from var-lib-calico (rw)
/var/log/calico/cni from cni-log-dir (rw)
/var/run/calico from var-run-calico (rw)
/var/run/nodeagent from policysync (rw)
Volumes:
lib-modules:
Type: HostPath (bare host directory volume)
Path: /lib/modules
HostPathType:
xtables-lock:
Type: HostPath (bare host directory volume)
Path: /run/xtables.lock
HostPathType: FileOrCreate
policysync:
Type: HostPath (bare host directory volume)
Path: /var/run/nodeagent
HostPathType: DirectoryOrCreate
tigera-ca-bundle:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: tigera-ca-bundle
Optional: false
node-certs:
Type: Secret (a volume populated by a Secret)
SecretName: node-certs
Optional: false
var-run-calico:
Type: HostPath (bare host directory volume)
Path: /var/run/calico
HostPathType:
var-lib-calico:
Type: HostPath (bare host directory volume)
Path: /var/lib/calico
HostPathType:
cni-bin-dir:
Type: HostPath (bare host directory volume)
Path: /opt/cni/bin
HostPathType:
cni-net-dir:
Type: HostPath (bare host directory volume)
Path: /etc/cni/net.d
HostPathType:
cni-log-dir:
Type: HostPath (bare host directory volume)
Path: /var/log/calico/cni
HostPathType:
flexvol-driver-host:
Type: HostPath (bare host directory volume)
Path: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~udsnodeagent~uds
HostPathType: DirectoryOrCreate
Priority Class Name: system-node-critical
Events: <none>
Name: csi-node-driver
Selector: k8s-app=csi-node-driver
Node-Selector: kubernetes.io/os=linux
Labels: <none>
Annotations: deprecated.daemonset.template.generation: 1
Desired Number of Nodes Scheduled: 52
Current Number of Nodes Scheduled: 52
Number of Nodes Scheduled with Up-to-date Pods: 52
Number of Nodes Scheduled with Available Pods: 28
Number of Nodes Misscheduled: 0
Pods Status: 28 Running / 24 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: app.kubernetes.io/name=csi-node-driver
k8s-app=csi-node-driver
name=csi-node-driver
Containers:
calico-csi:
Image: /docker.io/calico/csi:v3.24.1
Port: <none>
Host Port: <none>
Args:
--nodeid=$(KUBE_NODE_NAME)
--loglevel=$(LOG_LEVEL)
Environment:
LOG_LEVEL: warn
KUBE_NODE_NAME: (v1:spec.nodeName)
Mounts:
/csi from socket-dir (rw)
/etc/calico from etccalico (rw)
/var/lib/kubelet from kubelet-dir (rw)
/var/run from varrun (rw)
csi-node-driver-registrar:
Image: /docker.io/calico/node-driver-registrar:v3.24.1
Port: <none>
Host Port: <none>
Args:
--v=5
--csi-address=$(ADDRESS)
--kubelet-registration-path=$(DRIVER_REG_SOCK_PATH)
Environment:
ADDRESS: /csi/csi.sock
DRIVER_REG_SOCK_PATH: /var/lib/kubelet/plugins/csi.tigera.io/csi.sock
KUBE_NODE_NAME: (v1:spec.nodeName)
Mounts:
/csi from socket-dir (rw)
/registration from registration-dir (rw)
Volumes:
varrun:
Type: HostPath (bare host directory volume)
Path: /var/run
HostPathType:
etccalico:
Type: HostPath (bare host directory volume)
Path: /etc/calico
HostPathType:
kubelet-dir:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet
HostPathType: Directory
socket-dir:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/plugins/csi.tigera.io
HostPathType: DirectoryOrCreate
registration-dir:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/plugins_registry
HostPathType: Directory
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 26m (x94 over 121m) daemonset-controller (combined from similar events): Created pod: csi-node-driver-bw7cb
Normal SuccessfulCreate 14m daemonset-controller Created pod: csi-node-driver-zz92k
Normal SuccessfulCreate 12m daemonset-controller Created pod: csi-node-driver-d94k5
Normal SuccessfulCreate 12m daemonset-controller Created pod: csi-node-driver-m2ksc
Normal SuccessfulCreate 9m10s daemonset-controller Created pod: csi-node-driver-v8w22
Normal SuccessfulCreate 3m8s daemonset-controller Created pod: csi-node-driver-8xcpk
~
similar issue https://github.com/projectcalico/calico/issues/6407
Your Environment
- Calico version
calicoctl version
Client Version: v3.18.6
Git commit: 0f9952e1
Cluster Version: v3.18.6
Cluster Type: k8s,bgp,kubeadm,kubeadm,kdd,typha
- Orchestrator version (e.g. kubernetes, mesos, rkt):
kubernetes v1.22.14
- Operating System and version:
Ubuntu 20.04.4 LTS
So you're seeing
Failed to scale typha - Error: not enough linux nodes to schedule typha pods on, require 1 and have 0
Why does operator think there are no nodes?
after removing label projectcalico.org/operator-node-migration=pre-operator
kubectl label node ip-10-229-92-202.eu-west-1.compute.internal projectcalico.org/operator-node-migration-
it started migration but all calico node ds were having issues connecting to typha, I did
kubectl rollout restart deployment calico-typha -n calico-system
and now migration went farther
I'm reading the code to understand how it works but is there any doc explaining the process? for example this step
{"level":"info","ts":1664881514.5702138,"logger":"controller_installation","msg":"Patch NodeSelector with: [{\"op\":\"add\",\"path\":\"/spec/template/spec/nodeSelector/projectcalico.org~1operator-node-migration\",\"value
:\"pre-operator\"}]","Request.Namespace":"","Request.Name":"default"}
it's patching nodes with this label but expects ds to have the next selector:
kubectl get daemonsets
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
calico-node 0 0 0 0 0 kubernetes.io/os=linux,projectcalico.org/operator-node-migration=migrated 112m
@caseydavenport seems sub-optimal to me...
Is this in an environment that (say) reconciles labels?
@lwr20 we are using argocd so I guess that might be what you are referring to with regards to label reconciliation?
migration has been finished
❯ kubectl get tigerastatus calico
NAME AVAILABLE PROGRESSING DEGRADED SINCE
calico True False False 2m15s
but I can reproduce it and without manual actions above it gets stuck I've tried in several clusters, let me try the previous operator version as I had no issues with that one
@tmjd knows a bit about both operator and argo-cd
my theory why it gets stuck:
here we check if linuxNodes < expectedReplicas
in our case linuxNodes = 0
because all nodes were patched by the operator with the next tag projectcalico.org/operator-node-migration=pre-operator
and here we skip that kind of nodes. So when I remove that tag from any node operator can schedule typha and all goes fine. Am I missing something?
Also would be nice to be able to disable autoscaling for some use cases, if that ok I may try to contribute
Also would be nice to be able to disable autoscaling for some use cases, if that ok I may try to contribute
I think this is valid - would you mind opening a separate issue to discuss the design and use-cases for that? You can do so here: https://github.com/tigera/operator/issues/new
all nodes were patched by the operator with the next tag projectcalico.org/operator-node-migration=pre-operator and here we skip that kind of nodes. So when I remove that tag from any node operator can schedule typha and all goes fine. Am I missing something?
Hm, this problem sounds familiar to me and I am struggling to remember where I saw this before. I think the Typha auto-scaling log might be a red-herring here. It's complaining about a lack of nodes, but the typha auto scaler shouldn't stop the migration logic from progressing.
{"level":"info","ts":1664881519.7317343,"logger":"controller_installation","msg":"waiting for kube-system/calico-node to have 61 replicas, currently at 60","Request.Namespace":"","Request.Name":"default"}
It looks like the operator is waiting for the kube-system calico/node to be ready. Does it ever progress past this? If not, could you check the status of the kube-system daemonset to see why it is not fully ready?
Hello, I had the same issue when migrating to the tigera operator (v1.30.4) installation. Same error, and after removing the label from one of the nodes it worked and the migration was successful. I migrated for different clusters which had less than 20 nodes and didn't see any issues there. When migrating a bigger cluster (93 nodes) it timed out exactly after 10 min, and it didn't retry anymore. So I am guessing it doesn't wait long enough for the DaemonSet
{"level":"info","ts":"2023-08-08T16:34:13Z","logger":"controller_installation","msg":"Patch NodeSelector with: [{\"op\":\"add\",\"path\":\"/spec/template/spec/nodeSelector/projectcalico.org~1operator-node-migration\",\"value\":\"pre-operator\"}]","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2023-08-08T16:34:13Z","logger":"controller_installation","msg":"waiting for observed generation (5) to match object generation (6)","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2023-08-08T16:34:18Z","logger":"controller_installation","msg":"waiting for kube-system/calico-node to have 92 replicas, currently at 91","Request.Namespace":"","Request.Name":"default"}
....
....
{"level":"error","ts":"2023-08-08T16:44:13Z","logger":"controller_installation","msg":"error migrating resources to calico-system","Request.Namespace":"","Request.Name":"default","reason":"ResourceMigrationError","error":"the kube-system node DaemonSet is not ready with the updated nodeSelector: timed out waiting for the condition","stacktrace":"github.com/tigera/operator/pkg/controller/status.(*statusManager).SetDegraded\n\t/go/src/github.com/tigera/operator/pkg/controller/status/status.go:406\ngithub.com/tigera/operator/pkg/controller/installation.(*ReconcileInstallation).Reconcile\n\t/go/src/github.com/tigera/operator/pkg/controller/installation/core_controller.go:1436\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:122\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:323\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:235"}
Hello,
Same issue on our 24 nodes LAB cluster with Calico 3.26.3 / Tigera operator v1.30.7. Workaround: replace label projectcalico.org/operator-node-migration=pre-operator to projectcalico.org/operator-node-migration=migrated on a single node to unlock the migration.
Note: also fallen into 2 other issues
- Typha considering there are 0 nodes to deploy (previous install in kube-system with 3 replicas), then deploys correctly (1 replica, then 3 when migration is over)
-
csi-node-driver
stuck at ContainerCreating status (why it it deploying this daemonset, we did not have it before...). Missingcalico-cni-plugin
ServiceAccount inkube-system
for clusterrolebindingcalico-cni-plugin
, edited like below:
subjects:
- kind: ServiceAccount
name: calico-cni-plugin
namespace: calico-system
- kind: ServiceAccount
name: calico-cni-plugin
namespace: kube-system
We are hitting the same problem and it is very easy to reproduce by triggering a migration on a three node cluster after having simulating a worker node problem (i.e. systemctl stop kubelet
). The node will go not ready and once the pod eviction threshold is met, start the migration. Even after fixing the worker node, the migration will be stuck. The solution detailed in the previous comment works well. Of course this is very serious because the migration leaves the cluster effectively busted.
@caseydavenport ptal
I think I am still waiting on the answer to this question:
It looks like the operator is waiting for the kube-system calico/node to be ready. Does it ever progress past this? If not, could you check the status of the kube-system daemonset to see why it is not fully ready?
At least from the original report - root cause might be different for different folks.
The way this is designed to work, the migration will wait until the existing kube-system DaemonSet has been updated before progressing. In the original report, something was preventing the kube-system DaemonSet from being marked ready and I suspect that is the root cause.
@caseydavenport we confirmed in our test that the daemonset was healthy, but we are willing to test again and check everything that you'd like us to check.
@caseydavenport we believe that https://github.com/tigera/operator/pull/3156 should resolve this issue.
Going to close this as likely fixed by https://github.com/tigera/operator/pull/3156 for now, but can re-open if this is still apparent with that fix.