k8s-bigip-ctlr icon indicating copy to clipboard operation
k8s-bigip-ctlr copied to clipboard

CIS startup takes long time

Open kkfinkkfin opened this issue 2 years ago • 13 comments

Setup Details

CIS Version : 2.7.1
Build: f5networks/k8s-bigip-ctlr:latest
BIGIP Version: Big IP 15.1.4
Agent Mode: CCCL
Orchestration: K8S
Orchestration Version: kubernetes 1.18.20
Pool Mode: Cluster
Additional Setup details: flannel cni

Description

When the container shows that it is already running, the actual CIS container has not yet completed initialization, and this initialization time is a bit long. When the cluster has 200 VS, it takes about 2 minutes.

Steps To Reproduce

  1. Kubernetes has about 200 confimap vs
  2. restart the CIS pod
  3. after the CIS pod is running and apply an new virtualserver configmap
  4. watch the log when it output the log of Creating ApiApplicationService

Expected Result

When the CIS shows running, the CIS should have completed the initial synchronization. It is also desirable that the CIS startup time is less than 2 minutes.

Actual Result

When the CIS shows running, the initial synchronization is not actually completed. It took more than 2 minutes to complete the initial startup. And when the number of services in the cluster increases by more than 1,000, the startup time takes more than 30 minutes.

Diagnostic Information

CIS POD yaml (see the creationTimestamp, startedAt: "2022-03-04T07:23:28Z")

[root@cluster1-m1 singlevs]# kubectl get pods -n kube-system | grep bigip
cc-k8s-to-bigip1-7bdc66cc75-j9qr8                 1/1     Running     0          2m
[root@cluster1-m1 singlevs]# kubectl get pods -n kube-system cc-k8s-to-bigip1-7bdc66cc75-j9qr8 -o yaml
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2022-03-04T07:23:21Z"
  generateName: cc-k8s-to-bigip1-7bdc66cc75-
  labels:
    app: k8s-bigip-ctlr
    pod-template-hash: 7bdc66cc75
  name: cc-k8s-to-bigip1-7bdc66cc75-j9qr8
  namespace: kube-system
spec:
  containers:
  - args:
    - --bigip-username=admin
    - --bigip-password=admin.F5demo.com
    - --bigip-url=https://10.1.20.252
    - --bigip-partition=p1
    - --pool-member-type=cluster
    - --flannel-name=/Common/flannel_vxlan
    - --insecure
    - --log-level=INFO
    - --agent=cccl
    - --manage-ingress=false
    - --namespace=default
    - --disable-teems=true
    command:
    - /app/bin/k8s-bigip-ctlr
    image: f5networks/k8s-bigip-ctlr:2.7.1
    imagePullPolicy: IfNotPresent
    name: k8s-bigip-ctlr
    resources: {}
    securityContext:
      capabilities: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: bigip-ctlr-token-zdqv4
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: cluster1-m1
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: bigip-ctlr
  serviceAccountName: bigip-ctlr
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: bigip-ctlr-token-zdqv4
    secret:
      defaultMode: 420
      secretName: bigip-ctlr-token-zdqv4
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-03-04T07:23:21Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2022-03-04T07:23:28Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2022-03-04T07:23:28Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2022-03-04T07:23:21Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: docker://382205344cfd4288a580220e458817ef9e33cb82936f808426d32ec06e0e869f
    image: f5networks/k8s-bigip-ctlr:2.7.1
    imageID: docker-pullable://f5networks/k8s-bigip-ctlr@sha256:bbeaaefd039f2cce9c6aa3fed9c3815c790d40d25ee1eb7a61cc815af82a2eb8
    lastState: {}
    name: k8s-bigip-ctlr
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2022-03-04T07:23:28Z"
  hostIP: 172.16.240.134
  phase: Running
  podIP: 10.42.0.106
  podIPs:
  - ip: 10.42.0.106
  qosClass: BestEffort
  startTime: "2022-03-04T07:23:21Z"

CIS POD logs see the time, 07:23:28 [INFO] [INIT] Starting , 07:25:34 [INFO] [CCCL] Wrote 0 Virtual Server and 201 IApp configs

2022/03/04 07:23:28 [INFO] [INIT] Starting: Container Ingress Services - Version: 2.7.1, BuildInfo: azure-1817-92d998588e99e6ce7b06da08aacb3f92e6fb2ca1 
2022/03/04 07:23:28 [INFO] ConfigWriter started: 0xc0001db680 
2022/03/04 07:23:28 [INFO] Started config driver sub-process at pid: 17 
2022/03/04 07:23:28 [INFO] [INIT] Creating Agent for cccl 
2022/03/04 07:23:28 [INFO] [CCCL] Initializing CCCL Agent 
2022/03/04 07:23:28 [INFO] [CORE] NodePoller (0xc0002401b0) registering new listener: 0x17a7f00 
2022/03/04 07:23:28 [INFO] [CORE] NodePoller (0xc0002401b0) registering new listener: 0x1759260 
2022/03/04 07:23:28 [INFO] [CORE] NodePoller started: (0xc0002401b0) 
2022/03/04 07:23:28 [INFO] [CORE] Not watching Ingress resources. 
2022/03/04 07:23:28 [INFO] [CORE] Watching ConfigMap resources. 
2022/03/04 07:23:28 [INFO] [CORE] Handling ConfigMap resource events. 
2022/03/04 07:23:28 [INFO] [CORE] Not handling Ingress resource events. 
2022/03/04 07:23:28 [INFO] [CORE] Registered BigIP Metrics 
2022/03/04 07:23:29 [INFO] [2022-03-04 07:23:29,362 __main__ INFO] entering inotify loop to watch /tmp/k8s-bigip-ctlr.config614618021/config.json 
2022/03/04 07:25:34 [INFO] [CCCL] Wrote 0 Virtual Server and 201 IApp configs 
2022/03/04 07:25:37 [INFO] [2022-03-04 07:25:37,843 f5_cccl.resource.resource INFO] Creating ApiApplicationService: /p1/default_iapp-http 
2022/03/04 07:25:48 [INFO] [2022-03-04 07:25:48,713 f5_cccl.resource.resource INFO] Creating ApiArp: /Common/k8s-10.42.0.69 
2022/03/04 07:25:48 [INFO] [2022-03-04 07:25:48,777 f5_cccl.resource.resource INFO] Creating ApiArp: /Common/k8s-10.42.0.67 
2022/03/04 07:25:48 [INFO] [2022-03-04 07:25:48,822 f5_cccl.resource.resource INFO] Creating ApiArp: /Common/k8s-10.42.0.68 

VS's creationTimestamp (creationTimestamp: "2022-03-04T07:23:31Z")

[root@cluster1-m1 singlevs]# kubectl get configmap iapp-http -o yaml
apiVersion: v1
data:
  data: |
    {
      "virtualServer": {
        "backend": {
          "serviceName": "tea-svc",
          "servicePort": 80
        },
        "frontend": {
          "partition": "p1",
          "iapp": "/Common/iapp_http",
          "iappPoolMemberTable": {
            "name": "pool__members",
            "columns": [
                {"name": "addr", "kind": "IPAddress"},
                {"name": "port", "kind": "Port"},
                {"name": "connection_limit", "value": "0"}
            ]
          },
          "iappOptions": {
            "description": "iapp_http"
          },
          "iappVariables": {
            "pool__pool_to_use": "/#create_new#",
            "pool__addr": "10.1.20.135",
            "pool__port": "80",
            "vs__SNATConfig": "automap",
            "vs__ProfileTCP": "tcp",
            "vs__ProfileHTTP": "http",
            "vs__ProfileDefaultPersist": "none",
            "pool_lb": "least-connections-member",
            "monitor__Monitors": "tcp_default",
            "pool__irules": "none"
           }
        }
      }
    }
  schema: f5schemadb://bigip-virtual-server_v0.1.7.json
kind: ConfigMap
metadata:
  creationTimestamp: "2022-03-04T07:23:31Z"
  labels:
    f5type: virtual-server
  name: iapp-http
  namespace: default
  resourceVersion: "1943031"
  selfLink: /api/v1/namespaces/default/configmaps/iapp-http
  uid: 5d7577a1-258d-4695-8bff-438136ebb9c3

Observations (if any)

kkfinkkfin avatar Mar 04 '22 07:03 kkfinkkfin

Created [CONTCNTR-3181] for internal tracking.

trinaths avatar Mar 08 '22 07:03 trinaths

PR https://github.com/F5Networks/k8s-bigip-ctlr/pull/2299 was created for the issue, but we need more test and discussion around to have a proper fix.

olvandeng avatar Mar 17 '22 06:03 olvandeng

@kkfinkkfin / @olvandeng / @zongzw , this issue is raised for cccl mode or AS3 mode? I see the PR #2299 targeting the fix for AS3 mode majorly.

vklohiya avatar Mar 24 '22 13:03 vklohiya

Exactly, this issue is related to CCCL. However, AS3 mode has also the very issue which #2299 is aiming at. So, we may need your help/effert to analyze the pain points of CCCL mode causing long-waiting. From my understanding, because CCCL mode has the very similar logic that it handles all existing configmaps again at starting time, CCCL should have the same problem of long-waiting. Glad to show you the details about #2299. However, after more analyzing, it's still a partial fix, because multi-goroutine will block at ConfigDeployer's for msgReq := range am.ReqChan

zongzw avatar Mar 24 '22 13:03 zongzw

@zongzw, To solve this problem we don’t want to parallelise the vsQueue as it may result in partial AS3 declaration in backend, which may override config on BigIP depending on the resource processing in parallel vsQueue. Instead we want to process all the resources in vsQueue together and making sure that no duplicate resources are processed in the queue. We can also reduce the vsQueue size if we sync by namespace instead of each resource type. We have done the something similar for ingress processing. Checkout PR: https://github.com/F5Networks/k8s-bigip-ctlr/pull/1848

vklohiya avatar Mar 28 '22 10:03 vklohiya

There are 2 main difficulties needed to solve in my mind:

  1. long queuing in vsQueue. You mentioned "reducing vsQueue size", but every object enqueued in vsQueue are info pieces, needed for forming AS3 tree-like body. Filtering by namespace does no help when they all belongs to the same namespace(usually as that).
  2. serial AS3 deployment. No matter how fast vsQueue are processed, or how quick the info structure like appMgr.resources are arranged, the final step -- as3 posting to BIG-IP HAS TO BE serial. In filter-tenant=true case, the whole bundle of info are divided into tenants, posted to BIG-IP one by one, each of which costs at least 2~4s. if there are 100 groups(vs/pool/member), it will be 300s.

I'm open to any way to solve user's performance issue. But please show me code whenever you have as early as possible.

zongzw avatar Mar 28 '22 13:03 zongzw

@zongzw - Perfect! Recommend share updated PR with your findings so that team can review. Please make sure your fixes will not trigger a breaking change.

There are 2 main difficulties needed to solve in my mind:

  1. long queuing in vsQueue. You mentioned "reducing vsQueue size", but every object enqueued in vsQueue are info pieces, needed for forming AS3 tree-like body. Filtering by namespace does no help when they all belongs to the same namespace(usually as that).
  2. serial AS3 deployment. No matter how fast vsQueue are processed, or how quick the info structure like appMgr.resources are arranged, the final step -- as3 posting to BIG-IP HAS TO BE serial. In filter-tenant=true case, the whole bundle of info are divided into tenants, posted to BIG-IP one by one, each of which costs at least 2~4s. if there are 100 groups(vs/pool/member), it will be 300s.

I'm open to any way to solve user's performance issue. But please show me code whenever you have as early as possible.

trinaths avatar Mar 30 '22 06:03 trinaths

I have posted the PR #2299 to parallel vsQueue handling to shorten starting time. However, as I mentioned above, it cannot solve perfectly the problem, because of AS3 serial working model. What do you / your team thinking about this issue? Any PR? I suppose you have already the solution after the past 1 month.

zongzw avatar Mar 30 '22 06:03 zongzw

I have posted the PR #2299 to parallel vsQueue handling to shorten starting time. However, as I mentioned above, it cannot solve perfectly the problem, because of AS3 serial working model.

@zongzw Since CIS is a client for AS3, Please raise issues with AS3 if any. Based on PR review comments and comments above from PD team, IMO we need to revisit the PRs with a better solution.

trinaths avatar Mar 30 '22 09:03 trinaths

Sure, I'm open to any better and quick solution.

zongzw avatar Mar 30 '22 09:03 zongzw

@zongzw Perfect! Please resubmit PR #2299 with your findings.

trinaths avatar Mar 30 '22 09:03 trinaths

@trinaths The customer is experience the problem with CCCL, we'd focus on the CCCL first, keep you posted for the refresh PR #2299 and necessary CCCL serial change.

olvandeng avatar Mar 30 '22 10:03 olvandeng

@olvandeng Sure. Thanks.

trinaths avatar Mar 30 '22 10:03 trinaths

Closing as no further discussion.

trinaths avatar Feb 13 '24 16:02 trinaths