calico icon indicating copy to clipboard operation
calico copied to clipboard

Calico Operator Installation got stuck while migrating from manifest instalation

Open oleksii-boiko-ua opened this issue 2 years ago • 10 comments

We had a calico installation without the operator with version v3.18.6 and now trying to move to operator installation. We did it on 5 clusters more less smoothly (operator version v1.28.0) then we moved those 5 to (operator version v1.28.1) after that we started migrating the rest of the clusters. While doing that we got an issue when the installation got stuck.

Expected Behavior

Calico resources migrated from the kube-system namespace used by the Calico manifests to a new calico-system namespace

Current Behavior

Typha failed to scale and failed to move the calico-node pods to the calico-system namespace. The good thing is that calico-node pods are still running in kube-system namespace.

Possible Solution

  • Try the previous operator version
  • Change the labels of the current nodes from projectcalico.org/operator-node-migration=pre-operator to projectcalico.org/operator-node-migration=migrated as ds has the last one in the selector

Steps to Reproduce (for bugs)

  1. Install the Tigera Calico operator and custom resource definitions.
    helm upgrade --install calico projectcalico/tigera-operator --version v3.24.1 -f values.yaml --namespace tigera-operator

values.yaml


 tigera-operator:
   installation:
     enabled: true
     kubernetesProvider: ""
     calicoNetwork:
       ipPools:
       - blockSize: 26
         cidr: 192.168.0.0/16
         natOutgoing: "Enabled"
         nodeSelector: all()
       nodeAddressAutodetectionV4:
         kubernetes: NodeInternalIP
     nodeMetricsPort: 9091
     registry: registry/docker.io/

   apiServer:
     enabled: false

   resources:
     limits:
       cpu: 1
       memory: "2048Mi"
     requests:
       cpu: 200m
       memory: "1024Mi"

   # Configuration for the tigera operator
   tigeraOperator:
     image: tigera/operator
     version: v1.28.1
     registry: quay.io
   calicoctl:
     image: registry/docker.io/calico/ctl
     tag: v3.24.1

  1. Monitor the migration status with the following command
kubectl describe tigerastatus calico
Name:         calico
Namespace:
Labels:       <none>
Annotations:  <none>
API Version:  operator.tigera.io/v1
Kind:         TigeraStatus
Metadata:
  Creation Timestamp:  2022-10-04T11:05:11Z
  Generation:          1
  Managed Fields:
    API Version:  operator.tigera.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
    Manager:      operator
    Operation:    Update
    Time:         2022-10-04T11:05:11Z
    API Version:  operator.tigera.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:conditions:
    Manager:         operator
    Operation:       Update
    Subresource:     status
    Time:            2022-10-04T11:05:16Z
  Resource Version:  191417527
  UID:               14524d8a-d93a-4363-890e-fdfa779dac4e
Spec:
Status:
  Conditions:
    Last Transition Time:  2022-10-04T11:05:21Z
    Message:               Failed to scale typha - Error: not enough linux nodes to schedule typha pods on, require 1 and have 0
    Observed Generation:   2
    Reason:                ResourceScalingError
    Status:                True
    Type:                  Degraded
Events:                    <none>
  1. However some typha pods are created
kubectl get pods -n calico-system
NAME                                      READY   STATUS    RESTARTS   AGE
calico-kube-controllers-c9dd49845-npdwx   1/1     Running   0          101m
calico-typha-7f8477c5f6-pqzm9             1/1     Running   0          101m
csi-node-driver-2c4v5                     2/2     Running   0          101m
csi-node-driver-2qxpt                     0/2     Pending   0          101m
csi-node-driver-4gdvv                     0/2     Pending   0          100m
csi-node-driver-4qpvf                     2/2     Running   0          101m
csi-node-driver-5r557                     2/2     Running   0          101m
csi-node-driver-5x8bt                     0/2     Pending   0          101m
csi-node-driver-6mp27                     2/2     Running   0          101m
csi-node-driver-7nd5d                     0/2     Pending   0          101m
csi-node-driver-7pmrs                     0/2     Pending   0          94m
csi-node-driver-8nkgg                     2/2     Running   0          101m
  1. calico operator log
2022/10/04 11:05:03 [INFO] Version: v1.28.1
2022/10/04 11:05:03 [INFO] Go Version: go1.17.9b7
2022/10/04 11:05:03 [INFO] Go OS/Arch: linux/amd64
I1004 11:05:04.705332       1 request.go:665] Waited for 1.042182016s due to client-side throttling, not priority and fairness, request: GET:https://10.96.0.1:443/apis/batch/v1?timeout=32s
2022/10/04 11:05:05 [INFO] Active operator: proceeding
{"level":"info","ts":1664881506.7841918,"logger":"setup","msg":"Checking type of cluster","provider":""}
{"level":"info","ts":1664881506.7851846,"logger":"setup","msg":"Checking if PodSecurityPolicies are supported by the cluster","supported":true}
{"level":"info","ts":1664881506.7869463,"logger":"setup","msg":"Checking if TSEE controllers are required","required":false}
{"level":"info","ts":1664881506.8912253,"logger":"setup","msg":"starting manager"}
{"level":"info","ts":1664881506.8912084,"logger":"typha_autoscaler","msg":"Starting typha autoscaler","syncPeriod":10}
I1004 11:05:06.891301       1 leaderelection.go:248] attempting to acquire leader lease tigera-operator/operator-lock...
I1004 11:05:06.913451       1 leaderelection.go:258] successfully acquired lease tigera-operator/operator-lock
{"level":"info","ts":1664881506.9917126,"logger":"controller.apiserver-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1664881506.9917624,"logger":"controller.apiserver-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1664881506.991768,"logger":"controller.apiserver-controller","msg":"Starting EventSource","source":"kind source: /V1, Kind=ConfigMap"}
{"level":"info","ts":1664881506.9917734,"logger":"controller.apiserver-controller","msg":"Starting EventSource","source":"kind source: /V1, Kind=Secret"}
{"level":"info","ts":1664881506.991778,"logger":"controller.apiserver-controller","msg":"Starting EventSource","source":"kind source: /V1, Kind=Secret"}
{"level":"info","ts":1664881506.991782,"logger":"controller.apiserver-controller","msg":"Starting EventSource","source":"kind source: /V1, Kind=Secret"}
{"level":"info","ts":1664881506.9917858,"logger":"controller.apiserver-controller","msg":"Starting EventSource","source":"kind source: /V1, Kind=Secret"}
{"level":"info","ts":1664881506.9917898,"logger":"controller.apiserver-controller","msg":"Starting EventSource","source":"kind source: /V1, Kind=Secret"}
{"level":"info","ts":1664881506.991796,"logger":"controller.apiserver-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1664881506.9918036,"logger":"controller.apiserver-controller","msg":"Starting Controller"}
{"level":"info","ts":1664881506.9918213,"logger":"controller.tigera-installation-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1664881506.9918616,"logger":"controller.tigera-installation-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1664881506.991868,"logger":"controller.tigera-installation-controller","msg":"Starting EventSource","source":"kind source: /V1, Kind=Secret"}
{"level":"info","ts":1664881506.9918728,"logger":"controller.tigera-installation-controller","msg":"Starting EventSource","source":"kind source: /V1, Kind=ConfigMap"}
{"level":"info","ts":1664881506.9918792,"logger":"controller.tigera-installation-controller","msg":"Starting EventSource","source":"kind source: /V1, Kind=ConfigMap"}
{"level":"info","ts":1664881506.9918864,"logger":"controller.tigera-installation-controller","msg":"Starting EventSource","source":"kind source: /V1, Kind=ConfigMap"}
{"level":"info","ts":1664881506.9918902,"logger":"controller.tigera-installation-controller","msg":"Starting EventSource","source":"kind source: /V1, Kind=ConfigMap"}
{"level":"info","ts":1664881506.991894,"logger":"controller.tigera-installation-controller","msg":"Starting EventSource","source":"kind source: /V1, Kind=ConfigMap"}
{"level":"info","ts":1664881506.9918995,"logger":"controller.tigera-installation-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1664881506.9919055,"logger":"controller.tigera-installation-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1664881506.991912,"logger":"controller.tigera-installation-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1664881506.9919183,"logger":"controller.tigera-installation-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1664881506.9919267,"logger":"controller.tigera-installation-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1664881506.9919329,"logger":"controller.tigera-installation-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1664881506.991939,"logger":"controller.tigera-installation-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1664881506.991947,"logger":"controller.tigera-installation-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1664881506.991954,"logger":"controller.tigera-installation-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1664881506.9919627,"logger":"controller.tigera-installation-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1664881506.9919689,"logger":"controller.tigera-installation-controller","msg":"Starting Controller"}
{"level":"info","ts":1664881507.447693,"logger":"controller.apiserver-controller","msg":"Starting workers","worker count":1}
{"level":"info","ts":1664881507.4478097,"logger":"controller_apiserver","msg":"Reconciling APIServer","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":1664881507.4478521,"logger":"controller_apiserver","msg":"APIServer config not found","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":1664881507.4488175,"logger":"controller.tigera-installation-controller","msg":"Starting workers","worker count":1}
{"level":"info","ts":1664881507.9692104,"logger":"controller_apiserver","msg":"Reconciling APIServer","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":1664881507.9692528,"logger":"controller_apiserver","msg":"APIServer config not found","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":1664881507.969508,"logger":"windows_upgrader","msg":"Starting main loop"}
{"level":"info","ts":1664881507.977956,"logger":"controller_apiserver","msg":"Reconciling APIServer","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":1664881507.9779913,"logger":"controller_apiserver","msg":"APIServer config not found","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":1664881508.3646464,"logger":"controller_installation","msg":"adding active configmap"}
{"level":"info","ts":1664881508.6335487,"logger":"KubeAPIWarningLogger","msg":"policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+"}
{"level":"info","ts":1664881509.5906348,"logger":"controller_apiserver","msg":"Reconciling APIServer","Request.Namespace":"tigera-operator","Request.Name":"tigera-ca-private"}
{"level":"info","ts":1664881509.590689,"logger":"controller_apiserver","msg":"APIServer config not found","Request.Namespace":"tigera-operator","Request.Name":"tigera-ca-private"}
{"level":"info","ts":1664881511.8922532,"logger":"status_manager.calico","msg":"Status manager is not ready to report component statuses."}
{"level":"info","ts":1664881514.5702138,"logger":"controller_installation","msg":"Patch NodeSelector with: [{\"op\":\"add\",\"path\":\"/spec/template/spec/nodeSelector/projectcalico.org~1operator-node-migration\",\"value
:\"pre-operator\"}]","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":1664881514.7193437,"logger":"KubeAPIWarningLogger","msg":"spec.template.metadata.annotations[scheduler.alpha.kubernetes.io/critical-pod]: non-functional in v1.16+; use the \"priorityClassName\" field i
nstead"}
{"level":"info","ts":1664881514.7232409,"logger":"controller_installation","msg":"waiting for observed generation (2) to match object generation (3)","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":1664881516.8920703,"logger":"status_manager.calico","msg":"Status manager is not ready to report component statuses."}
{"level":"error","ts":1664881516.8921304,"logger":"typha_autoscaler","msg":"Failed to autoscale typha","error":"not enough linux nodes to schedule typha pods on, require 1 and have 0"}
{"level":"info","ts":1664881519.7317343,"logger":"controller_installation","msg":"waiting for kube-system/calico-node to have 61 replicas, currently at 60","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":1664881521.8914902,"logger":"status_manager.calico","msg":"Status manager is not ready to report component statuses."}
{"level":"info","ts":1664881524.7317426,"logger":"controller_installation","msg":"waiting for kube-system/calico-node to have 61 replicas, currently at 60","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":1664881526.8918767,"logger":"status_manager.calico","msg":"Status manager is not ready to report component statuses."}
{"level":"error","ts":1664881526.8919308,"logger":"typha_autoscaler","msg":"Failed to autoscale typha","error":"not enough linux nodes to schedule typha pods on, require 1 and have 0"}
{"level":"info","ts":1664881529.7316747,"logger":"controller_installation","msg":"waiting for kube-system/calico-node to have 61 replicas, currently at 60","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":1664881531.8915548,"logger":"status_manager.calico","msg":"Status manager is not ready to report component statuses."}
{"level":"info","ts":1664881534.732588,"logger":"controller_installation","msg":"waiting for kube-system/calico-node to have 61 replicas, currently at 60","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":1664881536.8914149,"logger":"status_manager.calico","msg":"Status manager is not ready to report component statuses."}
kubectl get nodes -o name | wc -l
      61
kubectl get daemonsets

NAME              DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                               AGE
calico-node       0         0         0       0            0           kubernetes.io/os=linux,projectcalico.org/operator-node-migration=migrated   112m
csi-node-driver   52        52        29      52           29          kubernetes.io/os=linux                                                      112m
kubectl describe deployment -n calico-system calico-typha


Name:                   calico-typha
Namespace:              calico-system
CreationTimestamp:      Tue, 04 Oct 2022 14:05:08 +0300
Labels:                 app.kubernetes.io/name=calico-typha
                        k8s-app=calico-typha
Annotations:            deployment.kubernetes.io/revision: 1
Selector:               k8s-app=calico-typha
Replicas:               1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  1 max unavailable, 25% max surge
Pod Template:
  Labels:           app.kubernetes.io/name=calico-typha
                    k8s-app=calico-typha
  Annotations:      hash.operator.tigera.io/tigera-ca-private: 7d014593994e80bde60e8998e3e165ce9c035ef0
                    hash.operator.tigera.io/typha-certs: ca40050371046fce3a23d83b0fa418ba8a9bc9b0
  Service Account:  calico-typha
  Containers:
   calico-typha:
    Image:      docker.io/calico/typha:v3.24.1
    Port:       5473/TCP
    Host Port:  5473/TCP
    Liveness:   http-get http://localhost:9098/liveness delay=0s timeout=10s period=10s #success=1 #failure=3
    Readiness:  http-get http://localhost:9098/readiness delay=0s timeout=10s period=10s #success=1 #failure=3
    Environment:
      TYPHA_LOGSEVERITYSCREEN:          info
      TYPHA_LOGFILEPATH:                none
      TYPHA_LOGSEVERITYSYS:             none
      TYPHA_CONNECTIONREBALANCINGMODE:  kubernetes
      TYPHA_DATASTORETYPE:              kubernetes
      TYPHA_HEALTHENABLED:              true
      TYPHA_HEALTHPORT:                 9098
      TYPHA_K8SNAMESPACE:               calico-system
      TYPHA_CAFILE:                     /etc/pki/tls/certs/tigera-ca-bundle.crt
      TYPHA_SERVERCERTFILE:             /typha-certs/tls.crt
      TYPHA_SERVERKEYFILE:              /typha-certs/tls.key
      TYPHA_FIPSMODEENABLED:            false
      TYPHA_CLIENTCN:                   typha-client
      KUBERNETES_SERVICE_HOST:          10.96.0.1
      KUBERNETES_SERVICE_PORT:          443
    Mounts:
      /etc/pki/tls/certs/ from tigera-ca-bundle (ro)
      /typha-certs from typha-certs (ro)
  Volumes:
   tigera-ca-bundle:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      tigera-ca-bundle
    Optional:  false
   typha-certs:
    Type:               Secret (a volume populated by a Secret)
    SecretName:         typha-certs
    Optional:           false
  Priority Class Name:  system-cluster-critical
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
  Progressing    True    NewReplicaSetAvailable
OldReplicaSets:  <none>
NewReplicaSet:   calico-typha-7f8477c5f6 (1/1 replicas created)
Events:          <none>
kubectl get deployment -n calico-system calico-typha -o yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
  creationTimestamp: "2022-10-04T11:05:08Z"
  generation: 1
  labels:
    app.kubernetes.io/name: calico-typha
    k8s-app: calico-typha
  name: calico-typha
  namespace: calico-system
  ownerReferences:
  - apiVersion: operator.tigera.io/v1
    blockOwnerDeletion: true
    controller: true
    kind: Installation
    name: default
    uid: d1c46db4-603a-4326-a73b-67c291222c38
  resourceVersion: "191257535"
  uid: 031a69ed-0087-4bc5-ae90-274f8b6ded6b
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 2
  selector:
    matchLabels:
      k8s-app: calico-typha
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      annotations:
        hash.operator.tigera.io/tigera-ca-private: 7d014593994e80bde60e8998e3e165ce9c035ef0
        hash.operator.tigera.io/typha-certs: ca40050371046fce3a23d83b0fa418ba8a9bc9b0
      creationTimestamp: null
      labels:
        app.kubernetes.io/name: calico-typha
        k8s-app: calico-typha
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: k8s-app
                  operator: In
                  values:
                  - calico-typha
              topologyKey: topology.kubernetes.io/zone
            weight: 1
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                k8s-app: calico-typha
            namespaces:
            - kube-system
            topologyKey: kubernetes.io/hostname
      containers:
      - env:
        - name: TYPHA_LOGSEVERITYSCREEN
          value: info
        - name: TYPHA_LOGFILEPATH
          value: none
        - name: TYPHA_LOGSEVERITYSYS
          value: none
        - name: TYPHA_CONNECTIONREBALANCINGMODE
          value: kubernetes
        - name: TYPHA_DATASTORETYPE
          value: kubernetes
        - name: TYPHA_HEALTHENABLED
          value: "true"
        - name: TYPHA_HEALTHPORT
          value: "9098"
        - name: TYPHA_K8SNAMESPACE
          value: calico-system
        - name: TYPHA_CAFILE
          value: /etc/pki/tls/certs/tigera-ca-bundle.crt
        - name: TYPHA_SERVERCERTFILE
          value: /typha-certs/tls.crt
        - name: TYPHA_SERVERKEYFILE
          value: /typha-certs/tls.key
        - name: TYPHA_FIPSMODEENABLED
          value: "false"
        - name: TYPHA_CLIENTCN
          value: typha-client
        - name: KUBERNETES_SERVICE_HOST
          value: 10.96.0.1
        - name: KUBERNETES_SERVICE_PORT
          value: "443"
        image: docker.io/calico/typha:v3.24.1
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            host: localhost
            path: /liveness
            port: 9098
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 10
        name: calico-typha
        ports:
        - containerPort: 5473
          hostPort: 5473
          name: calico-typha
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            host: localhost
            path: /readiness
            port: 9098
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 10
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/pki/tls/certs/
          name: tigera-ca-bundle
          readOnly: true
        - mountPath: /typha-certs
          name: typha-certs
          readOnly: true
      dnsPolicy: ClusterFirst
      hostNetwork: true
      nodeSelector:
        kubernetes.io/os: linux
      priorityClassName: system-cluster-critical
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: calico-typha
      serviceAccountName: calico-typha
      terminationGracePeriodSeconds: 0
      tolerations:
      - key: CriticalAddonsOnly
        operator: Exists
      - effect: NoSchedule
        operator: Exists
      - effect: NoExecute
        operator: Exists
      volumes:
      - configMap:
          defaultMode: 420
          name: tigera-ca-bundle
        name: tigera-ca-bundle
      - name: typha-certs
        secret:
          defaultMode: 420
          secretName: typha-certs
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2022-10-04T11:05:09Z"
    lastUpdateTime: "2022-10-04T11:05:09Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  - lastTransitionTime: "2022-10-04T11:05:09Z"
    lastUpdateTime: "2022-10-04T11:05:16Z"
    message: ReplicaSet "calico-typha-7f8477c5f6" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  observedGeneration: 1
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1
kubectl get pods -o wide -n calico-system
NAME                                      READY   STATUS    RESTARTS   AGE     IP                NODE                                          NOMINATED NODE   READINESS GATES
calico-kube-controllers-c9dd49845-npdwx   1/1     Running   0          118m    192.168.56.232    ip-10-229-91-172.aws-region.compute.internal   <none>           <none>
calico-typha-7f8477c5f6-pqzm9             1/1     Running   0          118m    10.229.91.172     ip-10-229-91-172.aws-region.compute.internal   <none>           <none>
csi-node-driver-2c4v5                     2/2     Running   0          118m    192.168.153.213   ip-10-229-88-195.aws-region.compute.internal   <none>           <none>
csi-node-driver-2qxpt                     0/2     Pending   0          118m    <none>            <none>                                        <none>           <none>
csi-node-driver-4gdvv                     0/2     Pending   0          117m    <none>            <none>                                        <none>           <none>
csi-node-driver-4qpvf                     2/2     Running   0          118m    192.168.16.40     ip-10-229-93-106.aws-region.compute.internal   <none>           <none>
csi-node-driver-5r557                     2/2     Running   0          118m    192.168.203.34    ip-10-229-90-165.aws-region.compute.internal   <none>           <none>
csi-node-driver-5x8bt                     0/2     Pending   0          118m    <none>            <none>                                        <none>           <none>
csi-node-driver-6mp27                     2/2     Running   0          118m    192.168.147.111   ip-10-229-92-103.aws-region.compute.internal   <none>           <none>
csi-node-driver-7nd5d                     0/2     Pending   0          118m    <none>            <none>                                        <none>           <none>
csi-node-driver-7pmrs                     0/2     Pending   0          111m    <none>            <none>                                        <none>           <none>
csi-node-driver-8nkgg                     2/2     Running   0          118m    192.168.149.185   ip-10-229-93-0.aws-region.compute.internal     <none>           <none>
csi-node-driver-8xcpk                     2/2     Running   0          43s     192.168.28.52     ip-10-229-95-73.aws-region.compute.internal    <none>           <none>
csi-node-driver-98jhf                     0/2     Pending   0          109m    <none>            <none>                                        <none>           <none>
csi-node-driver-9vbzr                     2/2     Running   0          118m    192.168.189.40    ip-10-229-85-111.aws-region.compute.internal   <none>           <none>
csi-node-driver-b66v2                     0/2     Pending   0          111m    <none>            <none>                                        <none>           <none>
csi-node-driver-b94p5                     2/2     Running   0          118m    192.168.98.146    ip-10-229-86-234.aws-region.compute.internal   <none>           <none>
csi-node-driver-bw7cb                     2/2     Running   0          23m     192.168.13.87     ip-10-229-85-207.aws-region.compute.internal   <none>           <none>
csi-node-driver-cj2g9                     0/2     Pending   0          107m    <none>            <none>                                        <none>           <none>
csi-node-driver-ckqf5                     0/2     Pending   0          118m    <none>            <none>                                        <none>           <none>
csi-node-driver-dpkdt                     0/2     Pending   0          107m    <none>            <none>                                        <none>           <none>
csi-node-driver-f686r                     2/2     Running   0          118m    192.168.78.2      ip-10-229-88-121.aws-region.compute.internal   <none>           <none>
csi-node-driver-fl7wj                     0/2     Pending   0          118m    <none>            <none>                                        <none>           <none>
csi-node-driver-fxmxg                     0/2     Pending   0          118m    <none>            <none>                                        <none>           <none>
csi-node-driver-gcdf6                     0/2     Pending   0          110m    <none>            <none>                                        <none>           <none>
csi-node-driver-gfn6z                     0/2     Pending   0          109m    <none>            <none>                                        <none>           <none>
csi-node-driver-hgkt6                     2/2     Running   0          118m    192.168.194.201   ip-10-229-84-198.aws-region.compute.internal   <none>           <none>
csi-node-driver-jc592                     2/2     Running   0          118m    192.168.21.200    ip-10-229-95-29.aws-region.compute.internal    <none>           <none>
csi-node-driver-jhrjf                     2/2     Running   0          118m    192.168.78.178    ip-10-229-92-25.aws-region.compute.internal    <none>           <none>
csi-node-driver-jnn7h                     2/2     Running   0          118m    192.168.123.172   ip-10-229-91-8.aws-region.compute.internal     <none>           <none>
csi-node-driver-k5rn4                     0/2     Pending   0          118m    <none>            <none>                                        <none>           <none>
csi-node-driver-kmttn                     0/2     Pending   0          115m    <none>            <none>                                        <none>           <none>
csi-node-driver-ktmcx                     2/2     Running   0          118m    192.168.168.99    ip-10-229-94-162.aws-region.compute.internal   <none>           <none>
csi-node-driver-ltc46                     0/2     Pending   0          103m    <none>            <none>                                        <none>           <none>
csi-node-driver-m2ksc                     2/2     Running   0          9m43s   192.168.199.183   ip-10-229-93-48.aws-region.compute.internal    <none>           <none>
csi-node-driver-nfx4v                     0/2     Pending   0          115m    <none>            <none>                                        <none>           <none>
csi-node-driver-nvkhz                     2/2     Running   0          118m    192.168.59.241    ip-10-229-93-19.aws-region.compute.internal    <none>           <none>
csi-node-driver-nxl5c                     2/2     Running   0          118m    192.168.47.187    ip-10-229-88-48.aws-region.compute.internal    <none>           <none>
csi-node-driver-p4vps                     2/2     Running   0          118m    192.168.28.133    ip-10-229-86-157.aws-region.compute.internal   <none>           <none>
csi-node-driver-p9s7h                     0/2     Pending   0          109m    <none>            <none>                                        <none>           <none>
csi-node-driver-qhhlm                     2/2     Running   0          118m    192.168.46.4      ip-10-229-84-251.aws-region.compute.internal   <none>           <none>
csi-node-driver-qq8c5                     2/2     Running   0          118m    192.168.209.126   ip-10-229-94-37.aws-region.compute.internal    <none>           <none>
csi-node-driver-rj792                     0/2     Pending   0          106m    <none>            <none>                                        <none>           <none>
csi-node-driver-rmv8w                     2/2     Running   0          118m    192.168.56.233    ip-10-229-91-172.aws-region.compute.internal   <none>           <none>
csi-node-driver-rwngt                     2/2     Running   0          118m    192.168.155.131   ip-10-229-88-148.aws-region.compute.internal   <none>           <none>
csi-node-driver-sslsc                     2/2     Running   0          118m    192.168.92.200    ip-10-229-89-1.aws-region.compute.internal     <none>           <none>
csi-node-driver-tsb9t                     2/2     Running   0          118m    192.168.35.54     ip-10-229-90-60.aws-region.compute.internal    <none>           <none>
csi-node-driver-vf6sb                     0/2     Pending   0          104m    <none>            <none>                                        <none>           <none>
csi-node-driver-wfqxv                     0/2     Pending   0          118m    <none>            <none>                                        <none>           <none>
csi-node-driver-x72d6                     2/2     Running   0          118m    192.168.247.37    ip-10-229-86-223.aws-region.compute.internal   <none>           <none>
csi-node-driver-xgxqr                     2/2     Running   0          118m    192.168.226.56    ip-10-229-89-193.aws-region.compute.internal   <none>           <none>
csi-node-driver-xhltj                     0/2     Pending   0          102m    <none>            <none>                                        <none>           <none>
csi-node-driver-zhn5j                     2/2     Running   0          118m    192.168.113.80    ip-10-229-89-163.aws-region.compute.internal   <none>           <none>
csi-node-driver-zz92k                     2/2     Running   0          11m     192.168.49.41     ip-10-229-85-235.aws-region.compute.internal   <none>           <none>
kubectl describe daemonsets.apps -n calico-system


Name:           calico-node
Selector:       k8s-app=calico-node
Node-Selector:  kubernetes.io/os=linux,projectcalico.org/operator-node-migration=migrated
Labels:         <none>
Annotations:    deprecated.daemonset.template.generation: 1
Desired Number of Nodes Scheduled: 0
Current Number of Nodes Scheduled: 0
Number of Nodes Scheduled with Up-to-date Pods: 0
Number of Nodes Scheduled with Available Pods: 0
Number of Nodes Misscheduled: 0
Pods Status:  0 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:           app.kubernetes.io/name=calico-node
                    k8s-app=calico-node
  Annotations:      hash.operator.tigera.io/cni-config: 3828f6b2f7f9a34e88a4be8ae644087d12a85386
                    hash.operator.tigera.io/tigera-ca-private: 7d014593994e80bde60e8998e3e165ce9c035ef0
                    prometheus.io/port: 9091
                    prometheus.io/scrape: true
  Service Account:  calico-node
  Init Containers:
   flexvol-driver:
    Image:        /docker.io/calico/pod2daemon-flexvol:v3.24.1
    Port:         <none>
    Host Port:    <none>
    Environment:  <none>
    Mounts:
      /host/driver from flexvol-driver-host (rw)
   install-cni:
    Image:      /docker.io/calico/cni:v3.24.1
    Port:       <none>
    Host Port:  <none>
    Command:
      /opt/cni/bin/install
    Environment:
      CNI_CONF_NAME:            10-calico.conflist
      SLEEP:                    false
      CNI_NET_DIR:              /etc/cni/net.d
      CNI_NETWORK_CONFIG:       <set to the key 'config' of config map 'cni-config'>  Optional: false
      KUBERNETES_SERVICE_HOST:  10.96.0.1
      KUBERNETES_SERVICE_PORT:  443
    Mounts:
      /host/etc/cni/net.d from cni-net-dir (rw)
      /host/opt/cni/bin from cni-bin-dir (rw)
  Containers:
   calico-node:
    Image:      /docker.io/calico/node:v3.24.1
    Port:       <none>
    Host Port:  <none>
    Requests:
      cpu:      250m
    Liveness:   http-get http://localhost:9099/liveness delay=0s timeout=10s period=10s #success=1 #failure=3
    Readiness:  exec [/bin/calico-node -bird-ready -felix-ready] delay=0s timeout=5s period=10s #success=1 #failure=3
    Environment:
      DATASTORE_TYPE:                      kubernetes
      WAIT_FOR_DATASTORE:                  true
      CLUSTER_TYPE:                        k8s,operator,bgp
      CALICO_DISABLE_FILE_LOGGING:         false
      FELIX_DEFAULTENDPOINTTOHOSTACTION:   ACCEPT
      FELIX_HEALTHENABLED:                 true
      FELIX_HEALTHPORT:                    9099
      NODENAME:                             (v1:spec.nodeName)
      NAMESPACE:                            (v1:metadata.namespace)
      FELIX_TYPHAK8SNAMESPACE:             calico-system
      FELIX_TYPHAK8SSERVICENAME:           calico-typha
      FELIX_TYPHACAFILE:                   /etc/pki/tls/certs/tigera-ca-bundle.crt
      FELIX_TYPHACERTFILE:                 /node-certs/tls.crt
      FELIX_TYPHAKEYFILE:                  /node-certs/tls.key
      FIPS_MODE_ENABLED:                   false
      FELIX_TYPHACN:                       typha-server
      CALICO_MANAGE_CNI:                   true
      CALICO_IPV4POOL_CIDR:                192.168.0.0/16
      CALICO_IPV4POOL_IPIP:                Always
      CALICO_IPV4POOL_BLOCK_SIZE:          26
      CALICO_IPV4POOL_NODE_SELECTOR:       all()
      CALICO_IPV4POOL_DISABLE_BGP_EXPORT:  false
      FELIX_VXLANMTU:                      1440
      FELIX_WIREGUARDMTU:                  1440
      CALICO_NETWORKING_BACKEND:           bird
      FELIX_IPINIPMTU:                     1440
      IP:                                  autodetect
      IP_AUTODETECTION_METHOD:             kubernetes-internal-ip
      IP6:                                 none
      FELIX_IPV6SUPPORT:                   false
      FELIX_PROMETHEUSMETRICSENABLED:      true
      FELIX_PROMETHEUSMETRICSPORT:         9091
      KUBERNETES_SERVICE_HOST:             10.96.0.1
      KUBERNETES_SERVICE_PORT:             443
    Mounts:
      /etc/pki/tls/certs/ from tigera-ca-bundle (ro)
      /host/etc/cni/net.d from cni-net-dir (rw)
      /lib/modules from lib-modules (ro)
      /node-certs from node-certs (ro)
      /run/xtables.lock from xtables-lock (rw)
      /var/lib/calico from var-lib-calico (rw)
      /var/log/calico/cni from cni-log-dir (rw)
      /var/run/calico from var-run-calico (rw)
      /var/run/nodeagent from policysync (rw)
  Volumes:
   lib-modules:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/modules
    HostPathType:
   xtables-lock:
    Type:          HostPath (bare host directory volume)
    Path:          /run/xtables.lock
    HostPathType:  FileOrCreate
   policysync:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/nodeagent
    HostPathType:  DirectoryOrCreate
   tigera-ca-bundle:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      tigera-ca-bundle
    Optional:  false
   node-certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  node-certs
    Optional:    false
   var-run-calico:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/calico
    HostPathType:
   var-lib-calico:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/calico
    HostPathType:
   cni-bin-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /opt/cni/bin
    HostPathType:
   cni-net-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/cni/net.d
    HostPathType:
   cni-log-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log/calico/cni
    HostPathType:
   flexvol-driver-host:
    Type:               HostPath (bare host directory volume)
    Path:               /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~udsnodeagent~uds
    HostPathType:       DirectoryOrCreate
  Priority Class Name:  system-node-critical
Events:                 <none>


Name:           csi-node-driver
Selector:       k8s-app=csi-node-driver
Node-Selector:  kubernetes.io/os=linux
Labels:         <none>
Annotations:    deprecated.daemonset.template.generation: 1
Desired Number of Nodes Scheduled: 52
Current Number of Nodes Scheduled: 52
Number of Nodes Scheduled with Up-to-date Pods: 52
Number of Nodes Scheduled with Available Pods: 28
Number of Nodes Misscheduled: 0
Pods Status:  28 Running / 24 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:  app.kubernetes.io/name=csi-node-driver
           k8s-app=csi-node-driver
           name=csi-node-driver
  Containers:
   calico-csi:
    Image:      /docker.io/calico/csi:v3.24.1
    Port:       <none>
    Host Port:  <none>
    Args:
      --nodeid=$(KUBE_NODE_NAME)
      --loglevel=$(LOG_LEVEL)
    Environment:
      LOG_LEVEL:       warn
      KUBE_NODE_NAME:   (v1:spec.nodeName)
    Mounts:
      /csi from socket-dir (rw)
      /etc/calico from etccalico (rw)
      /var/lib/kubelet from kubelet-dir (rw)
      /var/run from varrun (rw)
   csi-node-driver-registrar:
    Image:      /docker.io/calico/node-driver-registrar:v3.24.1
    Port:       <none>
    Host Port:  <none>
    Args:
      --v=5
      --csi-address=$(ADDRESS)
      --kubelet-registration-path=$(DRIVER_REG_SOCK_PATH)
    Environment:
      ADDRESS:               /csi/csi.sock
      DRIVER_REG_SOCK_PATH:  /var/lib/kubelet/plugins/csi.tigera.io/csi.sock
      KUBE_NODE_NAME:         (v1:spec.nodeName)
    Mounts:
      /csi from socket-dir (rw)
      /registration from registration-dir (rw)
  Volumes:
   varrun:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run
    HostPathType:
   etccalico:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/calico
    HostPathType:
   kubelet-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet
    HostPathType:  Directory
   socket-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/plugins/csi.tigera.io
    HostPathType:  DirectoryOrCreate
   registration-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/plugins_registry
    HostPathType:  Directory
Events:
  Type    Reason            Age                  From                  Message
  ----    ------            ----                 ----                  -------
  Normal  SuccessfulCreate  26m (x94 over 121m)  daemonset-controller  (combined from similar events): Created pod: csi-node-driver-bw7cb
  Normal  SuccessfulCreate  14m                  daemonset-controller  Created pod: csi-node-driver-zz92k
  Normal  SuccessfulCreate  12m                  daemonset-controller  Created pod: csi-node-driver-d94k5
  Normal  SuccessfulCreate  12m                  daemonset-controller  Created pod: csi-node-driver-m2ksc
  Normal  SuccessfulCreate  9m10s                daemonset-controller  Created pod: csi-node-driver-v8w22
  Normal  SuccessfulCreate  3m8s                 daemonset-controller  Created pod: csi-node-driver-8xcpk
 ~ 

similar issue https://github.com/projectcalico/calico/issues/6407

Your Environment

  • Calico version
calicoctl version
Client Version:    v3.18.6
Git commit:        0f9952e1
Cluster Version:   v3.18.6
Cluster Type:      k8s,bgp,kubeadm,kubeadm,kdd,typha
  • Orchestrator version (e.g. kubernetes, mesos, rkt):
kubernetes v1.22.14
  • Operating System and version:
Ubuntu 20.04.4 LTS

oleksii-boiko-ua avatar Oct 04 '22 13:10 oleksii-boiko-ua

So you're seeing

Failed to scale typha - Error: not enough linux nodes to schedule typha pods on, require 1 and have 0

Why does operator think there are no nodes?

lwr20 avatar Oct 04 '22 16:10 lwr20

after removing label projectcalico.org/operator-node-migration=pre-operator kubectl label node ip-10-229-92-202.eu-west-1.compute.internal projectcalico.org/operator-node-migration- it started migration but all calico node ds were having issues connecting to typha, I did

kubectl rollout restart deployment   calico-typha  -n calico-system

and now migration went farther

oleksii-boiko-ua avatar Oct 04 '22 16:10 oleksii-boiko-ua

I'm reading the code to understand how it works but is there any doc explaining the process? for example this step

{"level":"info","ts":1664881514.5702138,"logger":"controller_installation","msg":"Patch NodeSelector with: [{\"op\":\"add\",\"path\":\"/spec/template/spec/nodeSelector/projectcalico.org~1operator-node-migration\",\"value
:\"pre-operator\"}]","Request.Namespace":"","Request.Name":"default"}

it's patching nodes with this label but expects ds to have the next selector:

kubectl get daemonsets

NAME              DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                               AGE
calico-node       0         0         0       0            0           kubernetes.io/os=linux,projectcalico.org/operator-node-migration=migrated   112m

oleksii-boiko-ua avatar Oct 04 '22 16:10 oleksii-boiko-ua

@caseydavenport seems sub-optimal to me...

lwr20 avatar Oct 04 '22 16:10 lwr20

Is this in an environment that (say) reconciles labels?

lwr20 avatar Oct 04 '22 16:10 lwr20

@lwr20 we are using argocd so I guess that might be what you are referring to with regards to label reconciliation?

justinwyer avatar Oct 04 '22 16:10 justinwyer

migration has been finished

❯ kubectl get  tigerastatus calico
NAME     AVAILABLE   PROGRESSING   DEGRADED   SINCE
calico   True        False         False      2m15s

but I can reproduce it and without manual actions above it gets stuck I've tried in several clusters, let me try the previous operator version as I had no issues with that one

oleksii-boiko-ua avatar Oct 04 '22 16:10 oleksii-boiko-ua

@tmjd knows a bit about both operator and argo-cd

lwr20 avatar Oct 04 '22 17:10 lwr20

my theory why it gets stuck: here we check if linuxNodes < expectedReplicas in our case linuxNodes = 0 because all nodes were patched by the operator with the next tag projectcalico.org/operator-node-migration=pre-operator and here we skip that kind of nodes. So when I remove that tag from any node operator can schedule typha and all goes fine. Am I missing something? Also would be nice to be able to disable autoscaling for some use cases, if that ok I may try to contribute

oleksii-boiko-ua avatar Oct 05 '22 13:10 oleksii-boiko-ua

Also would be nice to be able to disable autoscaling for some use cases, if that ok I may try to contribute

I think this is valid - would you mind opening a separate issue to discuss the design and use-cases for that? You can do so here: https://github.com/tigera/operator/issues/new

all nodes were patched by the operator with the next tag projectcalico.org/operator-node-migration=pre-operator and here we skip that kind of nodes. So when I remove that tag from any node operator can schedule typha and all goes fine. Am I missing something?

Hm, this problem sounds familiar to me and I am struggling to remember where I saw this before. I think the Typha auto-scaling log might be a red-herring here. It's complaining about a lack of nodes, but the typha auto scaler shouldn't stop the migration logic from progressing.

{"level":"info","ts":1664881519.7317343,"logger":"controller_installation","msg":"waiting for kube-system/calico-node to have 61 replicas, currently at 60","Request.Namespace":"","Request.Name":"default"}

It looks like the operator is waiting for the kube-system calico/node to be ready. Does it ever progress past this? If not, could you check the status of the kube-system daemonset to see why it is not fully ready?

caseydavenport avatar Oct 10 '22 21:10 caseydavenport

Hello, I had the same issue when migrating to the tigera operator (v1.30.4) installation. Same error, and after removing the label from one of the nodes it worked and the migration was successful. I migrated for different clusters which had less than 20 nodes and didn't see any issues there. When migrating a bigger cluster (93 nodes) it timed out exactly after 10 min, and it didn't retry anymore. So I am guessing it doesn't wait long enough for the DaemonSet

{"level":"info","ts":"2023-08-08T16:34:13Z","logger":"controller_installation","msg":"Patch NodeSelector with: [{\"op\":\"add\",\"path\":\"/spec/template/spec/nodeSelector/projectcalico.org~1operator-node-migration\",\"value\":\"pre-operator\"}]","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2023-08-08T16:34:13Z","logger":"controller_installation","msg":"waiting for observed generation (5) to match object generation (6)","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2023-08-08T16:34:18Z","logger":"controller_installation","msg":"waiting for kube-system/calico-node to have 92 replicas, currently at 91","Request.Namespace":"","Request.Name":"default"}
....

....
{"level":"error","ts":"2023-08-08T16:44:13Z","logger":"controller_installation","msg":"error migrating resources to calico-system","Request.Namespace":"","Request.Name":"default","reason":"ResourceMigrationError","error":"the kube-system node DaemonSet is not ready with the updated nodeSelector: timed out waiting for the condition","stacktrace":"github.com/tigera/operator/pkg/controller/status.(*statusManager).SetDegraded\n\t/go/src/github.com/tigera/operator/pkg/controller/status/status.go:406\ngithub.com/tigera/operator/pkg/controller/installation.(*ReconcileInstallation).Reconcile\n\t/go/src/github.com/tigera/operator/pkg/controller/installation/core_controller.go:1436\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:122\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:323\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:235"}

arteonprifti avatar Aug 08 '23 17:08 arteonprifti

Hello,

Same issue on our 24 nodes LAB cluster with Calico 3.26.3 / Tigera operator v1.30.7. Workaround: replace label projectcalico.org/operator-node-migration=pre-operator to projectcalico.org/operator-node-migration=migrated on a single node to unlock the migration.

Note: also fallen into 2 other issues

  • Typha considering there are 0 nodes to deploy (previous install in kube-system with 3 replicas), then deploys correctly (1 replica, then 3 when migration is over)
  • csi-node-driver stuck at ContainerCreating status (why it it deploying this daemonset, we did not have it before...). Missing calico-cni-plugin ServiceAccount in kube-system for clusterrolebinding calico-cni-plugin, edited like below:
subjects:
- kind: ServiceAccount
  name: calico-cni-plugin
  namespace: calico-system
- kind: ServiceAccount
  name: calico-cni-plugin
  namespace: kube-system

penoux avatar Nov 03 '23 10:11 penoux

We are hitting the same problem and it is very easy to reproduce by triggering a migration on a three node cluster after having simulating a worker node problem (i.e. systemctl stop kubelet). The node will go not ready and once the pod eviction threshold is met, start the migration. Even after fixing the worker node, the migration will be stuck. The solution detailed in the previous comment works well. Of course this is very serious because the migration leaves the cluster effectively busted.

rtheis avatar Jan 19 '24 20:01 rtheis

@caseydavenport ptal

rtheis avatar Jan 20 '24 11:01 rtheis

I think I am still waiting on the answer to this question:

It looks like the operator is waiting for the kube-system calico/node to be ready. Does it ever progress past this? If not, could you check the status of the kube-system daemonset to see why it is not fully ready?

At least from the original report - root cause might be different for different folks.

The way this is designed to work, the migration will wait until the existing kube-system DaemonSet has been updated before progressing. In the original report, something was preventing the kube-system DaemonSet from being marked ready and I suspect that is the root cause.

caseydavenport avatar Jan 24 '24 23:01 caseydavenport

@caseydavenport we confirmed in our test that the daemonset was healthy, but we are willing to test again and check everything that you'd like us to check.

rtheis avatar Jan 25 '24 10:01 rtheis

@caseydavenport we believe that https://github.com/tigera/operator/pull/3156 should resolve this issue.

rtheis avatar Feb 15 '24 11:02 rtheis

Going to close this as likely fixed by https://github.com/tigera/operator/pull/3156 for now, but can re-open if this is still apparent with that fix.

caseydavenport avatar Feb 26 '24 19:02 caseydavenport