k8s-bigip-ctlr
k8s-bigip-ctlr copied to clipboard
CIS startup takes long time
Setup Details
CIS Version : 2.7.1
Build: f5networks/k8s-bigip-ctlr:latest
BIGIP Version: Big IP 15.1.4
Agent Mode: CCCL
Orchestration: K8S
Orchestration Version: kubernetes 1.18.20
Pool Mode: Cluster
Additional Setup details: flannel cni
Description
When the container shows that it is already running, the actual CIS container has not yet completed initialization, and this initialization time is a bit long. When the cluster has 200 VS, it takes about 2 minutes.
Steps To Reproduce
- Kubernetes has about 200 confimap vs
- restart the CIS pod
- after the CIS pod is running and apply an new virtualserver configmap
- watch the log when it output the log of Creating ApiApplicationService
Expected Result
When the CIS shows running, the CIS should have completed the initial synchronization. It is also desirable that the CIS startup time is less than 2 minutes.
Actual Result
When the CIS shows running, the initial synchronization is not actually completed. It took more than 2 minutes to complete the initial startup. And when the number of services in the cluster increases by more than 1,000, the startup time takes more than 30 minutes.
Diagnostic Information
CIS POD yaml (see the creationTimestamp, startedAt: "2022-03-04T07:23:28Z")
[root@cluster1-m1 singlevs]# kubectl get pods -n kube-system | grep bigip
cc-k8s-to-bigip1-7bdc66cc75-j9qr8 1/1 Running 0 2m
[root@cluster1-m1 singlevs]# kubectl get pods -n kube-system cc-k8s-to-bigip1-7bdc66cc75-j9qr8 -o yaml
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: "2022-03-04T07:23:21Z"
generateName: cc-k8s-to-bigip1-7bdc66cc75-
labels:
app: k8s-bigip-ctlr
pod-template-hash: 7bdc66cc75
name: cc-k8s-to-bigip1-7bdc66cc75-j9qr8
namespace: kube-system
spec:
containers:
- args:
- --bigip-username=admin
- --bigip-password=admin.F5demo.com
- --bigip-url=https://10.1.20.252
- --bigip-partition=p1
- --pool-member-type=cluster
- --flannel-name=/Common/flannel_vxlan
- --insecure
- --log-level=INFO
- --agent=cccl
- --manage-ingress=false
- --namespace=default
- --disable-teems=true
command:
- /app/bin/k8s-bigip-ctlr
image: f5networks/k8s-bigip-ctlr:2.7.1
imagePullPolicy: IfNotPresent
name: k8s-bigip-ctlr
resources: {}
securityContext:
capabilities: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: bigip-ctlr-token-zdqv4
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: cluster1-m1
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: bigip-ctlr
serviceAccountName: bigip-ctlr
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: bigip-ctlr-token-zdqv4
secret:
defaultMode: 420
secretName: bigip-ctlr-token-zdqv4
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2022-03-04T07:23:21Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2022-03-04T07:23:28Z"
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2022-03-04T07:23:28Z"
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2022-03-04T07:23:21Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: docker://382205344cfd4288a580220e458817ef9e33cb82936f808426d32ec06e0e869f
image: f5networks/k8s-bigip-ctlr:2.7.1
imageID: docker-pullable://f5networks/k8s-bigip-ctlr@sha256:bbeaaefd039f2cce9c6aa3fed9c3815c790d40d25ee1eb7a61cc815af82a2eb8
lastState: {}
name: k8s-bigip-ctlr
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2022-03-04T07:23:28Z"
hostIP: 172.16.240.134
phase: Running
podIP: 10.42.0.106
podIPs:
- ip: 10.42.0.106
qosClass: BestEffort
startTime: "2022-03-04T07:23:21Z"
CIS POD logs see the time, 07:23:28 [INFO] [INIT] Starting , 07:25:34 [INFO] [CCCL] Wrote 0 Virtual Server and 201 IApp configs
2022/03/04 07:23:28 [INFO] [INIT] Starting: Container Ingress Services - Version: 2.7.1, BuildInfo: azure-1817-92d998588e99e6ce7b06da08aacb3f92e6fb2ca1
2022/03/04 07:23:28 [INFO] ConfigWriter started: 0xc0001db680
2022/03/04 07:23:28 [INFO] Started config driver sub-process at pid: 17
2022/03/04 07:23:28 [INFO] [INIT] Creating Agent for cccl
2022/03/04 07:23:28 [INFO] [CCCL] Initializing CCCL Agent
2022/03/04 07:23:28 [INFO] [CORE] NodePoller (0xc0002401b0) registering new listener: 0x17a7f00
2022/03/04 07:23:28 [INFO] [CORE] NodePoller (0xc0002401b0) registering new listener: 0x1759260
2022/03/04 07:23:28 [INFO] [CORE] NodePoller started: (0xc0002401b0)
2022/03/04 07:23:28 [INFO] [CORE] Not watching Ingress resources.
2022/03/04 07:23:28 [INFO] [CORE] Watching ConfigMap resources.
2022/03/04 07:23:28 [INFO] [CORE] Handling ConfigMap resource events.
2022/03/04 07:23:28 [INFO] [CORE] Not handling Ingress resource events.
2022/03/04 07:23:28 [INFO] [CORE] Registered BigIP Metrics
2022/03/04 07:23:29 [INFO] [2022-03-04 07:23:29,362 __main__ INFO] entering inotify loop to watch /tmp/k8s-bigip-ctlr.config614618021/config.json
2022/03/04 07:25:34 [INFO] [CCCL] Wrote 0 Virtual Server and 201 IApp configs
2022/03/04 07:25:37 [INFO] [2022-03-04 07:25:37,843 f5_cccl.resource.resource INFO] Creating ApiApplicationService: /p1/default_iapp-http
2022/03/04 07:25:48 [INFO] [2022-03-04 07:25:48,713 f5_cccl.resource.resource INFO] Creating ApiArp: /Common/k8s-10.42.0.69
2022/03/04 07:25:48 [INFO] [2022-03-04 07:25:48,777 f5_cccl.resource.resource INFO] Creating ApiArp: /Common/k8s-10.42.0.67
2022/03/04 07:25:48 [INFO] [2022-03-04 07:25:48,822 f5_cccl.resource.resource INFO] Creating ApiArp: /Common/k8s-10.42.0.68
VS's creationTimestamp (creationTimestamp: "2022-03-04T07:23:31Z")
[root@cluster1-m1 singlevs]# kubectl get configmap iapp-http -o yaml
apiVersion: v1
data:
data: |
{
"virtualServer": {
"backend": {
"serviceName": "tea-svc",
"servicePort": 80
},
"frontend": {
"partition": "p1",
"iapp": "/Common/iapp_http",
"iappPoolMemberTable": {
"name": "pool__members",
"columns": [
{"name": "addr", "kind": "IPAddress"},
{"name": "port", "kind": "Port"},
{"name": "connection_limit", "value": "0"}
]
},
"iappOptions": {
"description": "iapp_http"
},
"iappVariables": {
"pool__pool_to_use": "/#create_new#",
"pool__addr": "10.1.20.135",
"pool__port": "80",
"vs__SNATConfig": "automap",
"vs__ProfileTCP": "tcp",
"vs__ProfileHTTP": "http",
"vs__ProfileDefaultPersist": "none",
"pool_lb": "least-connections-member",
"monitor__Monitors": "tcp_default",
"pool__irules": "none"
}
}
}
}
schema: f5schemadb://bigip-virtual-server_v0.1.7.json
kind: ConfigMap
metadata:
creationTimestamp: "2022-03-04T07:23:31Z"
labels:
f5type: virtual-server
name: iapp-http
namespace: default
resourceVersion: "1943031"
selfLink: /api/v1/namespaces/default/configmaps/iapp-http
uid: 5d7577a1-258d-4695-8bff-438136ebb9c3
Observations (if any)
Created [CONTCNTR-3181] for internal tracking.
PR https://github.com/F5Networks/k8s-bigip-ctlr/pull/2299 was created for the issue, but we need more test and discussion around to have a proper fix.
@kkfinkkfin / @olvandeng / @zongzw , this issue is raised for cccl mode or AS3 mode? I see the PR #2299 targeting the fix for AS3 mode majorly.
Exactly, this issue is related to CCCL. However, AS3 mode has also the very issue which #2299 is aiming at.
So, we may need your help/effert to analyze the pain points of CCCL mode causing long-waiting.
From my understanding, because CCCL mode has the very similar logic that it handles all existing configmaps again at starting time, CCCL should have the same problem of long-waiting.
Glad to show you the details about #2299. However, after more analyzing, it's still a partial fix, because multi-goroutine will block at ConfigDeployer
's for msgReq := range am.ReqChan
@zongzw, To solve this problem we don’t want to parallelise the vsQueue as it may result in partial AS3 declaration in backend, which may override config on BigIP depending on the resource processing in parallel vsQueue. Instead we want to process all the resources in vsQueue together and making sure that no duplicate resources are processed in the queue. We can also reduce the vsQueue size if we sync by namespace instead of each resource type. We have done the something similar for ingress processing. Checkout PR: https://github.com/F5Networks/k8s-bigip-ctlr/pull/1848
There are 2 main difficulties needed to solve in my mind:
- long queuing in vsQueue. You mentioned "reducing vsQueue size", but every object enqueued in vsQueue are info pieces, needed for forming AS3 tree-like body. Filtering by namespace does no help when they all belongs to the same namespace(usually as that).
- serial AS3 deployment. No matter how fast vsQueue are processed, or how quick the info structure like
appMgr.resources
are arranged, the final step -- as3 posting to BIG-IP HAS TO BE serial. Infilter-tenant=true
case, the whole bundle of info are divided into tenants, posted to BIG-IP one by one, each of which costs at least 2~4s. if there are 100 groups(vs/pool/member), it will be 300s.
I'm open to any way to solve user's performance issue. But please show me code whenever you have as early as possible.
@zongzw - Perfect! Recommend share updated PR with your findings so that team can review. Please make sure your fixes will not trigger a breaking change.
There are 2 main difficulties needed to solve in my mind:
- long queuing in vsQueue. You mentioned "reducing vsQueue size", but every object enqueued in vsQueue are info pieces, needed for forming AS3 tree-like body. Filtering by namespace does no help when they all belongs to the same namespace(usually as that).
- serial AS3 deployment. No matter how fast vsQueue are processed, or how quick the info structure like
appMgr.resources
are arranged, the final step -- as3 posting to BIG-IP HAS TO BE serial. Infilter-tenant=true
case, the whole bundle of info are divided into tenants, posted to BIG-IP one by one, each of which costs at least 2~4s. if there are 100 groups(vs/pool/member), it will be 300s.I'm open to any way to solve user's performance issue. But please show me code whenever you have as early as possible.
I have posted the PR #2299 to parallel vsQueue handling to shorten starting time. However, as I mentioned above, it cannot solve perfectly the problem, because of AS3 serial working model. What do you / your team thinking about this issue? Any PR? I suppose you have already the solution after the past 1 month.
I have posted the PR #2299 to parallel vsQueue handling to shorten starting time. However, as I mentioned above, it cannot solve perfectly the problem, because of AS3 serial working model.
@zongzw Since CIS is a client for AS3, Please raise issues with AS3 if any. Based on PR review comments and comments above from PD team, IMO we need to revisit the PRs with a better solution.
Sure, I'm open to any better and quick solution.
@zongzw Perfect! Please resubmit PR #2299 with your findings.
@trinaths The customer is experience the problem with CCCL, we'd focus on the CCCL first, keep you posted for the refresh PR #2299 and necessary CCCL serial change.
@olvandeng Sure. Thanks.
Closing as no further discussion.