kube-ops-view icon indicating copy to clipboard operation
kube-ops-view copied to clipboard

ops-view causing kube-api to be unavailable from node

Open integrii opened this issue 5 years ago • 18 comments

Reproduce:

  • Deploy ops-view on a cluster with hundreds of nodes
  • Up resources for redis and ops-view so that it does not crash
  • Start checking that your node can communicate with the kubernetes API
  • Wait 24 hours

Results: You will notice that there are various 20 minute periods of time where the node running ops-view can not communicate with the cluster ip of the api kube-api service. This effects CNI IP provisioning, pod creation, pod termination, and anything else that requires the API. The issue resolves after a period of time and beings working. Other nodes are not affected by this connection blocking.

Assumptions: Something that ops-view does is overwhelming the API, causing the API to stop listening to requests from ops-view.

Notes: We have reproduced this on five different clusters. The error only came to light because we monitor our clusters with Kuberhealthy, which runs synthetic tests to ensure all pod creation and removal is working in a timely fashion.

integrii avatar Feb 25 '19 20:02 integrii

@integrii what resource limits to you give the API server? Both the API server resources or etcd might be the bottle neck. kube-ops-view currently polls all pods/nodes periodically (no watch).

hjacobs avatar Feb 25 '19 22:02 hjacobs

The API server is the default that kops provisions, which I think is the default of Kubernetes.

How much work is it to re-implement your dashboard with a watch? We absolutely love the display and were running it on the cafe screens for a couple days. I am sad that I can't run this on our large production clusters.

integrii avatar Feb 26 '19 17:02 integrii

We ran into a similar problem with ops-view. We noticed that it uses a huge amount of tcp_mem leading me to believe it somehow leaks either API connections or websocket connections for the UI. Once we uninstalled ops-view our cluster got back to normal operation.

geNAZt avatar Jun 10 '19 15:06 geNAZt

Another case here as well (the same as @geNAZt): after leaving kube-ops-view running during weekend I returned to the office to find the node that hosts the ops-view application to be barely responsible with tcp_mem of ~249k

zerkms avatar Aug 19 '19 03:08 zerkms

Screenshot from 2019-08-20 09-31-09

That's a 20 hours overview of the sockstat (the mem metric).

The step down at the beginning is a kube-ops-view app restart.

zerkms avatar Aug 19 '19 21:08 zerkms

@zerkms thanks for the insights, I'll try to reproduce the behavior.

hjacobs avatar Aug 20 '19 07:08 hjacobs

Another important detail: that leak happens regardless, but the rate of leaking is significantly higher if the app and redis run on different nodes.

So for myself I used pod affinity to run them together and created a cronjob that restarts the app every several hours :man_shrugging:

zerkms avatar Aug 21 '19 00:08 zerkms

@zerkms does it only happen when using Redis? (kube-ops-view also works without Redis)

hjacobs avatar Aug 21 '19 07:08 hjacobs

Oh, I did not know that. I'll check that tomorrow, but from my other observations the answer 99% likely would be "yes".

zerkms avatar Aug 21 '19 07:08 zerkms

I also have a demo setup deployed on https://kube-ops-view.demo.j-serv.de/ (it uses Redis), I did not observe the memory leak there so far, but I also don't have a browser window open permanently :smirk:

hjacobs avatar Aug 21 '19 07:08 hjacobs

I did not observe the memory leak

How did you check that? It's not RSS leak, it's a kernel tcp sockets memory leak

Check cat /proc/net/sockstat: the mem field is the one of interest.

zerkms avatar Aug 21 '19 07:08 zerkms

@zerkms you are right, what I wanted to say is that I did not observe the symptom/impact of any memory leak (unresponsive node):

cat /proc/net/sockstat | grep TCP.*mem
TCP: inuse 49 orphan 0 tw 33 alloc 484 mem 29990

hjacobs avatar Aug 21 '19 12:08 hjacobs

I did not observe the symptom/impact of any memory leak (unresponsive node):

Check cat /proc/sys/net/ipv4/tcp_mem

When the value in sockstat reaches the number in 3rd column the kernel would experience problems creating new or using the current network connections.

In our case it takes about 3 days to get an almost unresponsive node.

zerkms avatar Aug 21 '19 12:08 zerkms

@hjacobs breaking news: that leak is only happening when there are browser clients open with the application.

And another important note: when there are clients showing the page - the rate of leaking is lower when they (redis + the app) are on the same node, and higher when on different. (it leaks nevertheless, but it's a curious observation as well)

zerkms avatar Aug 22 '19 00:08 zerkms

@zerkms thanks for all your investigations --- kube-ops-view is definitely causing high TCP mem as I also can see on my K3s demo:

root@k3s-demo:~# cat /proc/sys/net/ipv4/tcp_mem
22200	29602	44400
root@k3s-demo:~# cat /proc/net/sockstat | grep TCP.*mem
TCP: inuse 50 orphan 0 tw 34 alloc 492 mem 30335
root@k3s-demo:~# cat /proc/sys/net/ipv4/tcp_mem
22200	29602	44400
root@k3s-demo:~# kubectl get pod
NAME                                    READY   STATUS    RESTARTS   AGE
kube-ops-view-7b9dd46fd8-vfqgt          1/1     Running   0          12d
kube-ops-view-redis-577f846477-lrgk2    1/1     Running   0          25d
kube-resource-report-5f77c8f5d9-lwtnl   2/2     Running   0          10d
kube-web-view-6f8d9ff748-xcp5c          1/1     Running   0          35h
nginx-c9767ffdf-22tgk                   1/1     Running   0          25d
nginx-c9767ffdf-8qzqr                   1/1     Running   0          25d
root@k3s-demo:~# kubectl delete pod kube-ops-view-7b9dd46fd8-vfqgt
pod "kube-ops-view-7b9dd46fd8-vfqgt" deleted
root@k3s-demo:~# cat /proc/net/sockstat | grep TCP.*mem
TCP: inuse 48 orphan 1 tw 26 alloc 173 mem 3

=> TCP mem goes from 30335 Kernel pages (?) to 3 Kernel pages. Not sure what the unit is but this suggests "Kernel pages" (4096 Bytes).

hjacobs avatar Aug 22 '19 07:08 hjacobs

Could reproduce it again:

cat /proc/net/sockstat | grep TCP.*mem
TCP: inuse 48 orphan 0 tw 28 alloc 697 mem 44386
...
kubectl delete pod kube-ops-view-7b9dd46fd8-g544t
...
cat /proc/net/sockstat | grep TCP.*mem
TCP: inuse 53 orphan 3 tw 13 alloc 212 mem 38
 pod kube-ops-view-7b9dd46fd8-g544t

hjacobs avatar Oct 16 '19 06:10 hjacobs

Did you have it opened in browser?

zerkms avatar Oct 16 '19 06:10 zerkms

Bump. I saw this software on a mailing list and remembered how cool it is. Any progress on reduction of client connection spam to the API server yet? I think refactoring to a reflector that watches the API could solve the problems here.

integrii avatar Jul 17 '20 01:07 integrii