k8s-bigip-ctlr icon indicating copy to clipboard operation
k8s-bigip-ctlr copied to clipboard

CIS nodepoller doesn't work and stop do the arps process If there is one node in the cluster which the Vtep MAC cannot be obtained

Open kkfinkkfin opened this issue 3 years ago • 15 comments

Setup Details

CIS Version : 2.7.0 Build: f5networks/k8s-bigip-ctlr:latest BIGIP Version: BIG-IP 15.1.4 Build 0.0.47 Final AS3 Version: none Agent Mode: CCCL Orchestration: K8S Orchestration Version: kubernetes v1.21.5 Pool Mode: Cluster Additional Setup details: <Platform/CNI Plugins/ cluster nodes/ etc>

Platform : CentOS Linux release 8.4.2105 Kernel: 4.18.0-305.19.1.el8_4.x86_64 CNI Plugins: flannel

Description

Due to one node in the cluster cannot get vtepmac,CIS nodepoller doesn't work and stop do the arps process.

Steps To Reproduce

  1. To reproduce the issue simulates the node loss vtepmac , we edit the node yaml file to remove the annotation flannel.alpha.coreos.com/backend-data: '{"VtepMAC":"5a:de:e9:80:38:7e"}' which was automatically inserted by flannel. #kubectl edit node cluster1-w1 and remove the annotation flannel.alpha.coreos.com/backend-data: '{"VtepMAC":"5a:de:e9:80:38:7e"}' and save.
  2. And scale one deployment which watch by CIS , wait long enough to see if VE have refresh configuration
  3. View the CIS log The normal worker node's CIDR is 10.42.0.0/24. The abnormal worker node's CIDR is 10.42.1.0/24.
2022/01/07 01:25:00 [INFO] [INIT] Starting: Container Ingress Services - Version: 2.7.0, BuildInfo: azure-1697-0dd06d23f0761fd29b1f614a52ed4b3695653cdd 
2022/01/07 01:25:01 [INFO] ConfigWriter started: 0xc000369020 
2022/01/07 01:25:01 [INFO] Started config driver sub-process at pid: 17 
2022/01/07 01:25:01 [INFO] [INIT] Creating Agent for cccl 
2022/01/07 01:25:01 [INFO] [CCCL] Initializing CCCL Agent 
2022/01/07 01:25:01 [INFO] [CCCL] Removing Partition p1_AS3 
 
2022/01/07 01:25:02 [INFO] [CORE] NodePoller (0xc0002645a0) registering new listener: 0x17a6700 
2022/01/07 01:25:02 [INFO] [CORE] NodePoller (0xc0002645a0) registering new listener: 0x1757a40 
2022/01/07 01:25:02 [INFO] [CORE] NodePoller started: (0xc0002645a0) 
2022/01/07 01:25:02 [INFO] [CORE] Not watching Ingress resources. 
2022/01/07 01:25:02 [INFO] [CORE] Watching ConfigMap resources. 
2022/01/07 01:25:02 [INFO] [CORE] Handling ConfigMap resource events. 
2022/01/07 01:25:02 [INFO] [CORE] Not handling Ingress resource events. 
2022/01/07 01:25:02 [INFO] [CORE] Registered BigIP Metrics 
2022/01/07 01:25:03 [INFO] [2022-01-07 01:25:03,585 __main__ INFO] entering inotify loop to watch /tmp/k8s-bigip-ctlr.config334657854/config.json 
2022/01/07 01:25:05 [INFO] [CCCL] Wrote 0 Virtual Server and 2 IApp configs 
2022/01/07 01:25:06 [INFO] [2022-01-07 01:25:06,589 f5_cccl.resource.resource INFO] Deleting IcrArp: /Common/k8s-10.42.1.4 
2022/01/07 01:25:06 [INFO] [2022-01-07 01:25:06,731 f5_cccl.resource.resource INFO] Deleting IcrArp: /Common/k8s-10.42.1.5 
2022/01/07 01:25:07 [INFO] [2022-01-07 01:25:07,664 f5_cccl.resource.resource INFO] Creating ApiArp: /Common/k8s-10.42.1.5 
2022/01/07 01:25:07 [INFO] [2022-01-07 01:25:07,737 f5_cccl.resource.resource INFO] Creating ApiArp: /Common/k8s-10.42.1.4 
2022/01/07 01:29:06 [INFO] [CCCL] Wrote 0 Virtual Server and 2 IApp configs 
2022/01/07 01:29:06 [ERROR] [VxLAN] Vxlan manager could not get VtepMac for 10.42.1.4's node. 
2022/01/07 01:29:06 [INFO] [CCCL] Wrote 0 Virtual Server and 2 IApp configs 
2022/01/07 01:29:06 [ERROR] [VxLAN] Vxlan manager could not get VtepMac for 10.42.1.5's node. 
2022/01/07 01:29:06 [INFO] [2022-01-07 01:29:06,480 f5_cccl.resource.resource INFO] Updating ApiApplicationService: /p1/default_tea 
2022/01/07 01:29:07 [INFO] [CCCL] Wrote 0 Virtual Server and 2 IApp configs 
2022/01/07 01:29:07 [ERROR] [VxLAN] Vxlan manager could not get VtepMac for 10.42.1.5's node. 
2022/01/07 01:29:07 [INFO] [CCCL] Wrote 0 Virtual Server and 2 IApp configs 
2022/01/07 01:29:07 [INFO] [2022-01-07 01:29:07,723 f5_cccl.resource.resource INFO] Updating ApiApplicationService: /p1/default_cafevs1 
2022/01/07 01:29:07 [INFO] [2022-01-07 01:29:07,952 f5_cccl.resource.resource INFO] Updating ApiApplicationService: /p1/default_tea 
2022/01/07 01:29:08 [INFO] [2022-01-07 01:29:08,342 f5_cccl.resource.resource INFO] Deleting IcrNode: /p1/10.42.1.4%0 
2022/01/07 01:29:08 [INFO] [2022-01-07 01:29:08,420 f5_cccl.resource.resource INFO] Deleting IcrNode: /p1/10.42.1.5%0 
2022/01/07 01:29:08 [INFO] [2022-01-07 01:29:08,702 f5_cccl.resource.resource INFO] Creating ApiArp: /Common/k8s-10.42.0.247 
2022/01/07 01:29:08 [INFO] [2022-01-07 01:29:08,767 f5_cccl.resource.resource INFO] Creating ApiArp: /Common/k8s-10.42.0.246 
2022/01/07 01:29:08 [INFO] [2022-01-07 01:29:08,826 f5_cccl.resource.resource INFO] Deleting IcrArp: /Common/k8s-10.42.1.4 
2022/01/07 01:29:08 [INFO] [2022-01-07 01:29:08,894 f5_cccl.resource.resource INFO] Deleting IcrArp: /Common/k8s-10.42.1.5 
2022/01/07 01:29:57 [INFO] [CCCL] Wrote 0 Virtual Server and 2 IApp configs 
2022/01/07 01:29:57 [ERROR] [VxLAN] Vxlan manager could not get VtepMac for 10.42.1.6's node. 
2022/01/07 01:29:57 [INFO] [CCCL] Wrote 0 Virtual Server and 2 IApp configs 
2022/01/07 01:29:57 [ERROR] [VxLAN] Vxlan manager could not get VtepMac for 10.42.1.6's node. 
2022/01/07 01:29:57 [INFO] [2022-01-07 01:29:57,478 f5_cccl.resource.resource INFO] Updating ApiApplicationService: /p1/default_tea 
2022/01/07 01:29:58 [INFO] [2022-01-07 01:29:58,489 f5_cccl.resource.resource INFO] Updating ApiApplicationService: /p1/default_tea 
2022/01/07 01:29:59 [INFO] [2022-01-07 01:29:59,025 f5_cccl.resource.resource INFO] Deleting IcrNode: /p1/10.42.0.246%0 
2022/01/07 01:30:02 [INFO] [2022-01-07 01:30:02,954 f5_cccl.resource.resource INFO] Updating ApiFDBTunnel: /Common/flannel_vxlan 
2022/01/07 01:30:06 [INFO] [CCCL] Wrote 0 Virtual Server and 2 IApp configs 
2022/01/07 01:30:06 [ERROR] [VxLAN] Vxlan manager could not get VtepMac for 10.42.1.7's node. 
2022/01/07 01:30:06 [INFO] [CCCL] Wrote 0 Virtual Server and 2 IApp configs 
2022/01/07 01:30:06 [ERROR] [VxLAN] Vxlan manager could not get VtepMac for 10.42.1.7's node. 
2022/01/07 01:30:06 [INFO] [2022-01-07 01:30:06,774 f5_cccl.resource.resource INFO] Updating ApiApplicationService: /p1/default_cafevs1 
2022/01/07 01:30:07 [INFO] [2022-01-07 01:30:07,722 f5_cccl.resource.resource INFO] Updating ApiApplicationService: /p1/default_cafevs1 
2022/01/07 01:30:08 [INFO] [2022-01-07 01:30:08,130 f5_cccl.resource.resource INFO] Deleting IcrNode: /p1/10.42.0.247%0 
2022/01/07 01:31:36 [INFO] [CCCL] Wrote 0 Virtual Server and 2 IApp configs 
2022/01/07 01:31:37 [ERROR] [VxLAN] Vxlan manager could not get VtepMac for 10.42.1.7's node. 
2022/01/07 01:31:37 [INFO] [CCCL] Wrote 0 Virtual Server and 2 IApp configs 
2022/01/07 01:31:37 [ERROR] [VxLAN] Vxlan manager could not get VtepMac for 10.42.1.6's node. 
2022/01/07 01:31:37 [INFO] [2022-01-07 01:31:37,345 f5_cccl.resource.resource INFO] Updating ApiApplicationService: /p1/default_cafevs1 
2022/01/07 01:31:38 [INFO] [2022-01-07 01:31:38,490 f5_cccl.resource.resource INFO] Updating ApiApplicationService: /p1/default_cafevs1 
2022/01/07 01:31:39 [INFO] [2022-01-07 01:31:39,007 f5_cccl.resource.resource INFO] Deleting IcrNode: /p1/10.42.1.7%0 
2022/01/07 01:31:50 [INFO] [CCCL] Wrote 0 Virtual Server and 2 IApp configs 
2022/01/07 01:31:50 [ERROR] [VxLAN] Vxlan manager could not get VtepMac for 10.42.1.6's node. 
2022/01/07 01:31:50 [INFO] [CCCL] Wrote 0 Virtual Server and 2 IApp configs 
2022/01/07 01:31:50 [INFO] [2022-01-07 01:31:50,416 f5_cccl.resource.resource INFO] Updating ApiApplicationService: /p1/default_tea 
2022/01/07 01:31:51 [INFO] [2022-01-07 01:31:51,401 f5_cccl.resource.resource INFO] Updating ApiApplicationService: /p1/default_tea 
2022/01/07 01:31:51 [INFO] [2022-01-07 01:31:51,765 f5_cccl.resource.resource INFO] Deleting IcrNode: /p1/10.42.1.6%0 
2022/01/07 01:31:52 [INFO] [2022-01-07 01:31:52,003 f5_cccl.resource.resource INFO] Creating ApiArp: /Common/k8s-10.42.0.249 
2022/01/07 01:31:52 [INFO] [2022-01-07 01:31:52,053 f5_cccl.resource.resource INFO] Creating ApiArp: /Common/k8s-10.42.0.248 
2022/01/07 01:31:52 [INFO] [2022-01-07 01:31:52,105 f5_cccl.resource.resource INFO] Deleting IcrArp: /Common/k8s-10.42.0.247 
2022/01/07 01:31:52 [INFO] [2022-01-07 01:31:52,167 f5_cccl.resource.resource INFO] Deleting IcrArp: /Common/k8s-10.42.0.246 
2022/01/07 01:32:05 [INFO] [CCCL] Wrote 0 Virtual Server and 2 IApp configs 
2022/01/07 01:32:05 [INFO] [2022-01-07 01:32:05,723 f5_cccl.resource.resource INFO] Updating ApiApplicationService: /p1/default_tea 
2022/01/07 01:32:07 [INFO] [2022-01-07 01:32:07,175 f5_cccl.resource.resource INFO] Creating ApiArp: /Common/k8s-10.42.0.250 
2022/01/07 01:36:53 [INFO] [CCCL] Wrote 0 Virtual Server and 2 IApp configs 
2022/01/07 01:36:54 [INFO] [2022-01-07 01:36:54,201 f5_cccl.resource.resource INFO] Updating ApiApplicationService: /p1/default_tea 
2022/01/07 01:36:55 [INFO] [2022-01-07 01:36:55,496 f5_cccl.resource.resource INFO] Creating ApiArp: /Common/k8s-10.42.0.251 
2022/01/07 01:37:05 [INFO] [CCCL] Wrote 0 Virtual Server and 2 IApp configs 
2022/01/07 01:37:05 [INFO] [2022-01-07 01:37:05,406 f5_cccl.resource.resource INFO] Updating ApiApplicationService: /p1/default_tea 
2022/01/07 01:37:06 [INFO] [2022-01-07 01:37:06,698 f5_cccl.resource.resource INFO] Creating ApiArp: /Common/k8s-10.42.0.252 
2022/01/07 01:37:16 [INFO] [CCCL] Wrote 0 Virtual Server and 2 IApp configs 
2022/01/07 01:37:16 [INFO] [2022-01-07 01:37:16,493 f5_cccl.resource.resource INFO] Updating ApiApplicationService: /p1/default_tea 
2022/01/07 01:37:17 [INFO] [2022-01-07 01:37:17,999 f5_cccl.resource.resource INFO] Creating ApiArp: /Common/k8s-10.42.0.253 
2022/01/07 01:37:27 [INFO] [CCCL] Wrote 0 Virtual Server and 2 IApp configs 
2022/01/07 01:37:28 [INFO] [2022-01-07 01:37:28,080 f5_cccl.resource.resource INFO] Updating ApiApplicationService: /p1/default_tea 
2022/01/07 01:37:29 [INFO] [2022-01-07 01:37:29,410 f5_cccl.resource.resource INFO] Creating ApiArp: /Common/k8s-10.42.0.254 
2022/01/07 01:38:35 [INFO] [CCCL] Wrote 0 Virtual Server and 2 IApp configs 
2022/01/07 01:38:35 [ERROR] [VxLAN] Vxlan manager could not get VtepMac for 10.42.1.8's node. 
2022/01/07 01:38:36 [INFO] [2022-01-07 01:38:36,166 f5_cccl.resource.resource INFO] Updating ApiApplicationService: /p1/default_cafevs1 
2022/01/07 01:38:38 [INFO] [CCCL] Wrote 0 Virtual Server and 2 IApp configs 
2022/01/07 01:38:38 [ERROR] [VxLAN] Vxlan manager could not get VtepMac for 10.42.1.8's node. 
2022/01/07 01:38:39 [INFO] [2022-01-07 01:38:39,086 f5_cccl.resource.resource INFO] Updating ApiApplicationService: /p1/default_cafevs1 
2022/01/07 01:38:42 [INFO] [CCCL] Wrote 0 Virtual Server and 2 IApp configs 
2022/01/07 01:38:42 [ERROR] [VxLAN] Vxlan manager could not get VtepMac for 10.42.1.10's node. 
2022/01/07 01:38:42 [INFO] [CCCL] Wrote 0 Virtual Server and 2 IApp configs 
2022/01/07 01:38:42 [ERROR] [VxLAN] Vxlan manager could not get VtepMac for 10.42.1.10's node. 
2022/01/07 01:38:43 [INFO] [2022-01-07 01:38:43,260 f5_cccl.resource.resource INFO] Updating ApiApplicationService: /p1/default_cafevs1 
2022/01/07 01:38:44 [INFO] [2022-01-07 01:38:44,198 f5_cccl.resource.resource INFO] Updating ApiApplicationService: /p1/default_cafevs1 
2022/01/07 01:38:44 [INFO] [2022-01-07 01:38:44,605 f5_cccl.resource.resource INFO] Deleting IcrNode: /p1/10.42.0.248%0 
2022/01/07 01:40:24 [INFO] [CCCL] Wrote 0 Virtual Server and 2 IApp configs 
2022/01/07 01:40:24 [ERROR] [VxLAN] Vxlan manager could not get VtepMac for 10.42.1.10's node. 
2022/01/07 01:40:24 [INFO] [2022-01-07 01:40:24,964 f5_cccl.resource.resource INFO] Updating ApiApplicationService: /p1/default_tea 
2022/01/07 01:40:25 [INFO] [2022-01-07 01:40:25,728 f5_cccl.resource.resource INFO] Deleting IcrNode: /p1/10.42.0.254%0 
2022/01/07 01:40:35 [INFO] [CCCL] Wrote 0 Virtual Server and 2 IApp configs 
2022/01/07 01:40:35 [ERROR] [VxLAN] Vxlan manager could not get VtepMac for 10.42.1.10's node. 
2022/01/07 01:40:35 [INFO] [CCCL] Wrote 0 Virtual Server and 2 IApp configs 
2022/01/07 01:40:35 [ERROR] [VxLAN] Vxlan manager could not get VtepMac for 10.42.1.10's node. 
2022/01/07 01:40:35 [INFO] [CCCL] Wrote 0 Virtual Server and 2 IApp configs 
2022/01/07 01:40:36 [ERROR] [VxLAN] Vxlan manager could not get VtepMac for 10.42.1.10's node. 
2022/01/07 01:40:36 [INFO] [CCCL] Wrote 0 Virtual Server and 2 IApp configs 
2022/01/07 01:40:36 [INFO] [2022-01-07 01:40:36,280 f5_cccl.resource.resource INFO] Updating ApiApplicationService: /p1/default_tea 
2022/01/07 01:40:36 [ERROR] [VxLAN] Vxlan manager could not get VtepMac for 10.42.1.10's node. 
2022/01/07 01:40:38 [INFO] [CCCL] Wrote 0 Virtual Server and 2 IApp configs 
2022/01/07 01:40:38 [ERROR] [VxLAN] Vxlan manager could not get VtepMac for 10.42.1.10's node. 
2022/01/07 01:40:38 [INFO] [CCCL] Wrote 0 Virtual Server and 2 IApp configs 
2022/01/07 01:40:38 [ERROR] [VxLAN] Vxlan manager could not get VtepMac for 10.42.1.10's node. 
2022/01/07 01:40:38 [INFO] [2022-01-07 01:40:38,897 f5_cccl.resource.resource INFO] Updating ApiApplicationService: /p1/default_tea 
2022/01/07 01:40:39 [INFO] [2022-01-07 01:40:39,492 f5_cccl.resource.resource INFO] Deleting IcrNode: /p1/10.42.0.252%0 
2022/01/07 01:40:39 [INFO] [2022-01-07 01:40:39,577 f5_cccl.resource.resource INFO] Deleting IcrNode: /p1/10.42.0.253%0 
2022/01/07 01:40:40 [INFO] [2022-01-07 01:40:40,201 f5_cccl.resource.resource INFO] Updating ApiApplicationService: /p1/default_tea 
2022/01/07 01:40:40 [INFO] [2022-01-07 01:40:40,580 f5_cccl.resource.resource INFO] Deleting IcrNode: /p1/10.42.0.251%0 
2022/01/07 01:40:44 [INFO] [CCCL] Wrote 0 Virtual Server and 2 IApp configs 
2022/01/07 01:40:45 [ERROR] [VxLAN] Vxlan manager could not get VtepMac for 10.42.1.10's node. 
2022/01/07 01:40:45 [INFO] [CCCL] Wrote 0 Virtual Server and 2 IApp configs 
2022/01/07 01:40:45 [ERROR] [VxLAN] Vxlan manager could not get VtepMac for 10.42.1.10's node. 
2022/01/07 01:40:45 [INFO] [CCCL] Wrote 0 Virtual Server and 2 IApp configs 
2022/01/07 01:40:45 [ERROR] [VxLAN] Vxlan manager could not get VtepMac for 10.42.1.10's node. 
2022/01/07 01:40:45 [INFO] [2022-01-07 01:40:45,531 f5_cccl.resource.resource INFO] Updating ApiApplicationService: /p1/default_tea 
2022/01/07 01:40:46 [INFO] [2022-01-07 01:40:46,813 f5_cccl.resource.resource INFO] Updating ApiApplicationService: /p1/default_tea 
2022/01/07 01:40:47 [INFO] [CCCL] Wrote 0 Virtual Server and 2 IApp configs 
2022/01/07 01:40:47 [INFO] [2022-01-07 01:40:47,417 f5_cccl.resource.resource INFO] Deleting IcrNode: /p1/10.42.0.249%0 
2022/01/07 01:40:47 [ERROR] [VxLAN] Vxlan manager could not get VtepMac for 10.42.1.10's node. 
2022/01/07 01:40:47 [INFO] [2022-01-07 01:40:47,539 f5_cccl.resource.resource INFO] Deleting IcrNode: /p1/10.42.0.250%0 
2022/01/07 01:40:47 [INFO] [2022-01-07 01:40:47,976 f5_cccl.resource.resource INFO] Updating ApiApplicationService: /p1/default_tea 
  1. View the pod IP
[root@cluster1-m1 1]# kubectl get pod  -o wide
NAME                     READY   STATUS    RESTARTS   AGE     IP           NODE          NOMINATED NODE   READINESS GATES
coffee-87b9987b4-lzch9   1/1     Running   0          3m58s   10.42.1.10   cluster1-w1   <none>           <none>
coffee-87b9987b4-nmsk2   1/1     Running   0          4m6s    10.42.1.8    cluster1-w1   <none>           <none>
coffee-87b9987b4-q778v   1/1     Running   0          4m2s    10.42.1.9    cluster1-w1   <none>           <none>
tea-67977d68b-4qbcz      1/1     Running   0          117s    10.42.0.6    cluster1-m1   <none>           <none>
tea-67977d68b-664f9      1/1     Running   0          116s    10.42.0.7    cluster1-m1   <none>           <none>
tea-67977d68b-8xtwl      1/1     Running   0          2m8s    10.42.0.5    cluster1-m1   <none>           <none>
tea-67977d68b-j2lb8      1/1     Running   0          114s    10.42.0.8    cluster1-m1   <none>           <none>
tea-67977d68b-rxwkl      1/1     Running   0          2m8s    10.42.0.3    cluster1-m1   <none>           <none>
tea-67977d68b-tsc6k      1/1     Running   0          2m8s    10.42.0.2    cluster1-m1   <none>           <none>
  1. View the VE ARP list and you see that the new pod ip did not update even the pod running in the normal and healthy worker node. image

Expected Result

CIS outputs the error log, but nodepoller and arp process still works.

Actual Result

CIS outputs the error log, but nodepoller and arp process does not works.

Diagnostic Information

The CIS yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: k8s-bigip-ctlr1
  name: cc-k8s-to-bigip1
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: k8s-bigip-ctlr1
  template:
    metadata:
      labels:
        app: k8s-bigip-ctlr1
      name: k8s-bigip-ctlr1
    spec:
      containers:
      - args:
        - --bigip-username=$(BIGIP_USERNAME)
        - --bigip-password=$(BIGIP_PASSWORD)
        - --manage-ingress=false
        - --bigip-partition=partition1
        - --bigip-url=https://10.1.20.252
        - --pool-member-type=cluster
        - --flannel-name=/Common/flannel_vxlan
        - --insecure=true
        - --agent=cccl
        command:
        - /app/bin/k8s-bigip-ctlr
        env:
        - name: BIGIP_USERNAME
          valueFrom:
            secretKeyRef:
              key: username
              name: bigip-login1
              optional: false
        - name: BIGIP_PASSWORD
          valueFrom:
            secretKeyRef:
              key: password
              name: bigip-login1
              optional: false
        image: f5networks/k8s-bigip-ctlr:2.7.0
        imagePullPolicy: Always
        name: k8s-bigip-ctlr1
      serviceAccount: bigip-ctlr
      serviceAccountName: bigip-ctlr

Observations (if any)

It may be related to the flannel bug “Flannel Annotations "flannel.alpha.coreos.com" issue #1122”

kkfinkkfin avatar Jan 07 '22 01:01 kkfinkkfin

@kkfinkkfin - why do you want to manually remove the flannel annotation??

To reproduce the issue simulates the node loss vtepmac , we edit the node yaml file to remove the annotation flannel.alpha.coreos.com/backend-data: '{"VtepMAC":"5a:de:e9:80:38:7e"}'

based on the flannel bug its an infra issue. Based on the CIS documentation, we suppose to have flannel config or other infra valid for proper functioning of CIS.

IMO, this issue is invalid.

trinaths avatar Jan 07 '22 03:01 trinaths

@kkfinkkfin please share steps to reproduce this issue without manual intervention. This helps in understanding the usecase better.

trinaths avatar Jan 07 '22 03:01 trinaths

@trinaths Maybe we can make a little enhancement. Let the pods that sit on the node that with correct flannel annotations to be updated into F5.

myf5 avatar Jan 07 '22 04:01 myf5

@trinaths Maybe we can make a little enhancement. Let the pods that sit on the node that with correct flannel annotations to be updated into F5.

@myf5 not sure what exactly you want achieve with this.

trinaths avatar Jan 07 '22 05:01 trinaths

this will let CIS to be more robust, even only one node has wrong annotations, other normal nodes can still work with CIS. Thus will make customer's business to be continued.

myf5 avatar Jan 07 '22 07:01 myf5

@myf5 perfect!. please share steps to reproduce this issue without manual intervention. This helps in understanding the usecase better.

trinaths avatar Jan 07 '22 07:01 trinaths

I can not understand why you keep asking reproduce a problem without manual intervention. As @kkfinkkfin said, the root cause of the flannel issue can not be reproduced always. But once the flannel issue happened, the result is there is wrong annotation on some nodes. Since the result of the flannel issue is consistent, why we can not mock the issue result? Is that so important to let flannel trigger the issue? Why CIS can not update other pods arp that running on the normal nodes?

myf5 avatar Jan 07 '22 10:01 myf5

@myf5 - Manual intervention to reproduce a problem doesn't help to properly fix an issue. It doesn't cover the real cause of this. When we reproduce an issue without manual intervention, it helps to clearly understand

  • how often this issue is happening ?
  • is this issue user infra related ?
  • How clear and robust can CIS act on the issue ?
  • Will the fix enhance or compromise UX?
  • and many more....

Robustness of CIS is a vast space and we move in with priority and level of UX. Issues like this which are reproduced with manual intervention doesn't give complete picture of issue, rather just a part of it. Fix in CIS based on issues caused due to manual intervention doesn't scale well and may compromise UX.

The more the clear the issue (without manual intervention), the more robust the fix is.

trinaths avatar Jan 07 '22 12:01 trinaths

CIS does not need care about how often the flannel trigger the issue. CIS should care about the result of the issue(the annotation errors). And take according actions to make CIS work in production level. There are too much things that can not find root cause, but it does not mean we just wait and do not take any actions to remedy it.

With my point of views: More customers is using CIS in this production environment, some of my customer had put it in financial trading system . CIS should be more care about the quality and performance seriously. More confidence from customers, more extend opportunities.

myf5 avatar Jan 10 '22 00:01 myf5

@myf5

I guess a quick hack/fix is to remove log.Errorf and return, use log.Infof ("[VxLAN] %v", err) https://github.com/F5Networks/k8s-bigip-ctlr/blob/master/pkg/vxlan/vxlanMgr.go#L229-L233

                var mac string                                                   
                mac, err = getVtepMac(pod, kubePods, kubeNodes)                  
                if nil != err {                                                  
                        log.Errorf("[VxLAN] %v", err)                            
                        return                                                   
                }                                                        

vincentmli avatar Jan 10 '22 02:01 vincentmli

@myf5

I guess a quick hack/fix is to remove log.Errorf and return, use log.Infof ("[VxLAN] %v", err) https://github.com/F5Networks/k8s-bigip-ctlr/blob/master/pkg/vxlan/vxlanMgr.go#L229-L233

if only log informational message, it will result in pod ARP with empty MAC for the node missing flannel annotation, not sure if BIG-IP mcpd would accept pod ARP with empty MAC, and if BIG-IP mcpd accept pod empty MAC, the traffic to the pod will not work, maybe when node missing flannel annotation, we could use a fake MAC, when customer noticed traffic not working to the pod with fake MAC, they could check CIS log and we should have informational log about the node missing flannel annotation so customer will have clue to fix flannel issue.

vincentmli avatar Jan 10 '22 03:01 vincentmli

@myf5 I guess a quick hack/fix is to remove log.Errorf and return, use log.Infof ("[VxLAN] %v", err) https://github.com/F5Networks/k8s-bigip-ctlr/blob/master/pkg/vxlan/vxlanMgr.go#L229-L233

if only log informational message, it will result in pod ARP with empty MAC for the node missing flannel annotation, not sure if BIG-IP mcpd would accept pod ARP with empty MAC, and if BIG-IP mcpd accept pod empty MAC, the traffic to the pod will not work, maybe when node missing flannel annotation, we could use a fake MAC, when customer noticed traffic not working to the pod with fake MAC, they could check CIS log and we should have informational log about the node missing flannel annotation so customer will have clue to fix flannel issue.

maybe, just ignore those pods on the wrong annotations nodes. And CIS has log to notice customers that there is wrong annotations on nodes, had ignored the pods, pls fix it. So customer can clearly get the reason.

myf5 avatar Jan 10 '22 06:01 myf5

@myf5 I guess a quick hack/fix is to remove log.Errorf and return, use log.Infof ("[VxLAN] %v", err) https://github.com/F5Networks/k8s-bigip-ctlr/blob/master/pkg/vxlan/vxlanMgr.go#L229-L233

if only log informational message, it will result in pod ARP with empty MAC for the node missing flannel annotation, not sure if BIG-IP mcpd would accept pod ARP with empty MAC, and if BIG-IP mcpd accept pod empty MAC, the traffic to the pod will not work, maybe when node missing flannel annotation, we could use a fake MAC, when customer noticed traffic not working to the pod with fake MAC, they could check CIS log and we should have informational log about the node missing flannel annotation so customer will have clue to fix flannel issue.

maybe, just ignore those pods on the wrong annotations nodes. And CIS has log to notice customers that there is wrong annotations on nodes, had ignored the pods, pls fix it. So customer can clearly get the reason.

yes, I like this idea and it is even simple to implement, following should do it, it is up to CIS team to fix it though, just giving suggestion here

		mac, err = getVtepMac(pod, kubePods, kubeNodes)
		if nil != err {
			log.Infof("[VxLAN] %v", err)
			continue
		}

vincentmli avatar Jan 10 '22 16:01 vincentmli

@myf5 Were you able to validate this workaround ? Is this suggestion appropriate ? Please share your findings.

yes, I like this idea and it is even simple to implement, following should do it, it is up to CIS team to fix it though, just giving suggestion here

		mac, err = getVtepMac(pod, kubePods, kubeNodes)
		if nil != err {
			log.Infof("[VxLAN] %v", err)
			continue
		}

trinaths avatar Jan 26 '22 05:01 trinaths

Created CONTCNTR-3139 for internal tracking.

trinaths avatar Jan 28 '22 08:01 trinaths

Closing as no further discussion.

trinaths avatar Feb 13 '24 16:02 trinaths