dashboard icon indicating copy to clipboard operation
dashboard copied to clipboard

Resource watch re-subscription uses wrong resource version and floods the k8s API server

Open cmurphy opened this issue 3 years ago • 14 comments
trafficstars

Setup

  • Rancher version: 0e2fdc8
  • Browser type & version:

Describe the bug

The Vue dashboard uses websockets to subscribe to kubernetes events in the downstream cluster. The watch has a default timeout of 30 minutes. If the dashboard is left open for more than that, the watch will expire and the dashboard will attempt to resubscribe. However, the watch request requires a resource version to start the watch from, and the dashboard is providing an invalid version, causing the subscription to fail. The page must be refreshed in order to properly re-start the watch. In extreme cases with a large scale, this causes the kubernetes API server to become overloaded.

To Reproduce

Create a downstream cluster. Turn on debug logging for the agent by setting CATTLE_DEBUG to true in the environment variable configuration for the cluster. Turn on verbose logging for the kubernetes API server by editing the YAML, and under rancher_kubernetes_engine_config->services->kube-api add a section

...
kube-api:
  extra_args:
    v: '3'
...

When the cluster is running, navigate to "Deployments" under "Workloads". Create at least one workload, like a Deployment or Statefulset, or deploy an App and come back to the workloads tab. Open the Network tab.

EDIT reproducing this depends on having the List resource version be extremely skewed from the resource version of the item that is picked as a base. To reproduce this consistently, create AND delete several (i.e. tens to hundreds) of Deployments. THEN navigate to the Workloads tab and refresh. If the watch is started while Deployments are being created and deleted, the resource version used may come from one of the websocket updates, and the problem will not be visible, but if there is no activity happening, it will be picked from the last list response.

Result

In the Network tab, you can see the initial requests for the list of objects relevant to Workloads, for example:

GET https://<server>/k8s/clusters/<cluster>/v1/apps.statefulsets

:arrow_right:

{
    "type": "collection",
    "links": {
        "self": "..."
    },
    "createTypes": {
        "apps.statefulset": "..."
    },
    "actions": {},
    "resourceType": "apps.statefulset",
    "revision": "400478",
    "data": [
        {
            "id": "cattle-prometheus/prometheus-cluster-monitoring",
            "type": "apps.statefulset",
            "links": {
               ...
            },
            "apiVersion": "apps/v1",
            "kind": "StatefulSet",
            "metadata": {
                "name": "prometheus-cluster-monitoring",
                "namespace": "cattle-prometheus",
                 ...
                "resourceVersion": "17159",
            ....
        }
        {
            .... another statefulset ....
        }

In one of the websocket requests (for me it's always wss://<server>/k8s/clusters/<cluster>/v1/subscribe?sockId=3), you can see an outgoing websocket message to start a watch for that resource type and version:

:arrow_up_small: {"resourceType":"apps.statefulset","resourceVersion":"400478"}

After 30 minutes, the watch is closed (which is expected, because of the default timeout):

:arrow_down_small: {name: "resource.stop", resourceType: "apps.statefulset", data: {type: "subscribe", links: {}}}

We then see a new outgoing watch request, but now for a completely different resource version:

:arrow_up_small: {"resourceType":"apps.statefulset","resourceVersion":"17159"}

The watch immediately fails and is closed:

:arrow_down_small: {name: "resource.stop", resourceType: "apps.statefulset", data: {type: "subscribe", links: {}}}

and the cycle continues.

In the cluster agent debug logs, you can see the watch being stopped and started on repeat:

time="2022-05-19T23:16:34Z" level=debug msg="opening watcher for batch.job"
time="2022-05-19T23:16:34Z" level=debug msg="opening watcher for apps.statefulset"
time="2022-05-19T23:16:34Z" level=debug msg="opening watcher for namespace"
time="2022-05-19T23:16:34Z" level=debug msg="closing watcher for namespace"
time="2022-05-19T23:16:34Z" level=debug msg="closing watcher for batch.job"
time="2022-05-19T23:16:34Z" level=debug msg="closing watcher for apps.statefulset"
time="2022-05-19T23:16:36Z" level=debug msg="Wrote ping"
time="2022-05-19T23:16:36Z" level=debug msg="Wrote ping"
time="2022-05-19T23:16:37Z" level=debug msg="Wrote ping"
time="2022-05-19T23:16:37Z" level=debug msg="Wrote ping"
time="2022-05-19T23:16:40Z" level=debug msg="opening watcher for apps.statefulset"
time="2022-05-19T23:16:40Z" level=debug msg="opening watcher for batch.job"
time="2022-05-19T23:16:40Z" level=debug msg="closing watcher for batch.job"
time="2022-05-19T23:16:40Z" level=debug msg="closing watcher for apps.statefulset"
time="2022-05-19T23:16:40Z" level=debug msg="opening watcher for namespace"

On the kubernetes API server, you will see it being hammered by repeat GET requests with the URL including watch=true.

It turns out this resourceVersion is from one individual item in the list of resources, see JSON above. This is not the revision number returned for the collection by steve. Steve fails to initialize the watch because the resource version is "too old".

Note that kubernetes uses the field resourceVersion both for the version of the List and the versions of the individual resources, where steve changes the List's resourceVersion to revision.

When re-subscribing after a timeout, the revision number for the whole collection needs to be used, not the resource version number of any one item in the collection.

Expected Result

After a timeout, a new watch is started successfully and is ongoing until the next timeout.

Screenshots

Additional context

  • This may not be 100% reproducible because sometimes the item resource version is close enough to the collection revision that the watch still works, but this is an accident. If you deploy additional workloads and apps without making updates to any of them, eventually the collection revision number will get significantly bigger while the original item version number stays the same, and you should see the issue occur.

  • To help with debugging, I've proposed a change to steve which will

  1. return a resource.error event through the websocket if the connection is closed irregularly
  2. logs the error in the cluster agent
  3. allows you to decrease the timeout from 30 minutes, by setting CATTLE_WATCH_TIMEOUT_SECONDS to, e.g. 300 (5 minutes) in the cluster agent environment.

The agent image with this change is published as ~~cmurpheus/rancher-agent:v2.6-head-0e2fdc-steve-debug.~~ EDIT: new image with latest changes at cmurpheus/rancher-agent:v2.6-head-f4e1de7-steve-debug

cmurphy avatar May 20 '22 00:05 cmurphy

Connects to internal reference SURE-3208 amongst others potentially (SURE-4484)

gaktive avatar May 20 '22 15:05 gaktive

Whoever takes this on, @cmurphy is happy to help.

gaktive avatar May 20 '22 20:05 gaktive

As noted in the linked PR (https://github.com/rancher/dashboard/pull/6024), we have a fix for this via Harvester but this needs to be reviewed first.

gaktive avatar May 24 '22 16:05 gaktive

@rak-phillip to look this over and sync with @cmurphy to see if the existing PR helps out here or if other work is required. Then we'd need to test.

gaktive avatar May 24 '22 16:05 gaktive

Can we also get an estimate on when this would have started happening. Has it existed since the beginning of dashboard, or was it introduced at some point in time ?

cbron avatar May 25 '22 17:05 cbron

I spoke with @cmurphy offline and I think we should merge rancher/steve#49 because it looks like it fixes the issue that we're experiencing today. We can still make some changes to Dashboard to improve on the behavior described above, but the change to Steve should resolve this issue:

  1. Update our caching mechanism to to distinguish between list and item resources. I'm not seeing a way to improve on this with what's immediately available to us in dashboard. We make use of the type (e.g. catalog.cattle.io.app) and attempt to look up the latestresourceVersion by making use of Math.max()..
  2. I think we would be able to make use of resyncWatch if Steve was able to relay to Dashboard that the revision is too old in the resource.stop event. Dashboard currently doesn't have any information available to discern if the cache is invalidated unless we receive a message that contains too old in the resource.error event.. I think we'd be able to appropriately make use of the local cache and resyncWatch if we could make use of something like 410 Gone on the resource.stop event. https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes

@gaktive

rak-phillip avatar Jun 13 '22 17:06 rak-phillip

@cmurphy Was able to reproduce after creating/deleting around 30 deployments and waiting roughly 45 minutes.

Observed websockets with sockId1 and sockId3 starting and stopping various watches.

https://user-images.githubusercontent.com/85186645/173461659-67031962-df65-4c20-a896-b155b596c65b.mp4

Observed cattle-agent debug logs confirming that watchers were being opened and closed repeatedly. Screen Shot 2022-06-13 at 6 13 41 PM

Observed very high rate of WATCH requests on jobs resources, matching the observed watcher starts/stops. Screen Shot 2022-06-13 at 6 16 08 PM

git-ival avatar Jun 13 '22 22:06 git-ival

I tried to repro this recently and couldn't - I'm assuming that's because Colleen's PR was merged. I forced an incorrect resource version to be used - a random number - and confirmed that the socket has self-corrected and is still able to watch the resource. Or at least I THINK that is what this screenshot is showing: Screen Shot 2022-06-20 at 12 22 02 PM

@rak-phillip That's a great doc that you linked. In response to your comment:

We make use of the type (e.g. catalog.cattle.io.app) and attempt to look up the latestresourceVersion by making use of Math.max()

This part of the linked doc makes me think the dashboard shouldn't be comparing resource versions with Math.max() at all:

You must not assume resource versions are numeric or collatable. API clients may only compare two resource versions for equality (this means that you must not compare resource versions for greater-than or less-than relationships).

I think there should probably be a new tech debt issue opened about making some minimum enhancements. I agree that we should accept the 410 gone response. And I think we should completely remove the logic (the Math.max) for assigning a version number if there is none. If we don't have a resourceVersion number, instead of trying to find one based on comparisons, we just shouldn't provide one at all.

Side note on tools

I was also curious about this part of those upstream docs that mention tooling for watching K8s resources:

For subscribing to collections, Kubernetes client libraries typically offer some form of standard tool for this list-then-watch logic. (In the Go client library, this is called a Reflector and is located in the k8s.io/client-go/tools/cache package.)

I wonder why we have an in-house implementation of findAll, or if there is a better tool that can be used with Vue. I found this example of watching a K8s collection with a Javascript client and it looks simpler than what we have. https://github.com/kubernetes-client/javascript/blob/master/examples/cache-example.js#L13 If we used that client it could help us differentiate between lists and individual resources.

Side note on performance

Before reading the linked docs I thought a big reason the UI takes a long time to load was that etcd doesn't support pagination. But in fact, the K8s docs says the K8s API server does support pagination to make UIs faster and reduce server load:

On large clusters, retrieving the collection of some resource types may result in very large responses that can impact the server and client. For instance, a cluster may have tens of thousands of Pods, each of which is equivalent to roughly 2 KiB of encoded JSON. Retrieving all pods across all namespaces may result in a very large response (10-20MB) and consume a large amount of server resources.

Provided that you don't explicitly disable the APIListChunking feature gate, the Kubernetes API server supports the ability to break a single large collection request into many smaller chunks while preserving the consistency of the total request. Each chunk can be returned sequentially which reduces both the total size of the request and allows user-oriented clients to display results incrementally to improve responsiveness.

Since the K8s API supports pagination, there should probably be another performance issue about looking into pagination for the list views.

catherineluse avatar Jun 20 '22 16:06 catherineluse

@catherineluse

I tried to repro this recently and couldn't - I'm assuming that's because Colleen's PR was merged. I forced an incorrect resource version to be used - a random number - and confirmed that the socket has self-corrected and is still able to watch the resource. Or at least I THINK that is what this screenshot is showing:

The flood of requests was corrected, but the dashboard is still doing incorrect behavior that can still be reproduced following the instructions above. This is what it looks like:

image

  1. The watch starts with the correct resource version.
  2. After a timeout, the watch stops.
  3. The watch is restarted with a seemingly out-of-nowhere revision number
  4. With the steve fix in place, Rancher now returns an error.
  5. Dashboard responds to the error by restarting the watch again, now with a sensible version number.

Even though this now isn't directly impacting the user, I maintain that step 3 is not the right behavior, that the dashboard should use a valid number from the start instead of having to re-correct after feedback from the backend.

cmurphy avatar Jun 22 '22 00:06 cmurphy

Need to revisit in 2.7.0 with the backend team

nwmac avatar Jul 06 '22 17:07 nwmac

@catherineluse to sync with @Sean-McQ about this, @rak-phillip available to help as needed. We then need to sync with Team 1.

gaktive avatar Sep 13 '22 16:09 gaktive

Internal reference: SURE-5317

Question that floated up:

would tuning CATTLE_WATCH_TIMEOUT_SECONDS provide any relief?

gaktive avatar Oct 03 '22 15:10 gaktive

That setting is for debugging only. The default is 30 minutes. It does not matter what you change the setting to, because the problem is not how often it occurs, it is that it occurs at all. To reiterate what I said in https://github.com/rancher/dashboard/issues/5997#issuecomment-1162478045

The flood of requests was corrected, but the dashboard is still doing incorrect behavior that can still be reproduced following the instructions above.

The flooding issue was worked around in steve, so users will mostly not notice the issue, so there is no "relief" needed. This issue is still open because the behavior is still incorrect.

cmurphy avatar Oct 03 '22 17:10 cmurphy

This also appears to affect rancher/rancher#39033. It would be best to have this fixed before we attempt to handle the ghost cluster issue from the backend side any further.

MbolotSuse avatar Oct 12 '22 19:10 MbolotSuse

Manually reopening as github bot has failed to do so

richard-cox avatar Dec 13 '22 11:12 richard-cox

2.6.x backport: https://github.com/rancher/dashboard/issues/7779

gaktive avatar Dec 21 '22 16:12 gaktive

Actual backport ticket: https://github.com/rancher/dashboard/issues/7707 (7779 closed as duplicate)

gaktive avatar Dec 21 '22 20:12 gaktive

I have retested this issue and from my understanding of this issue and the merged PRs, this issue should be closed. I am not seeing repetitions of the same websocket IDs and it seems that the stop -> start watch behavior now uses sensible version #s.

Content of websocket messages before and after a stop: image

Websocket IDs after multiple timeouts: image

git-ival avatar Feb 07 '23 08:02 git-ival

I'm running Rancher 2.6.11 and I'm not convinced that the backported fix in #7719 has actually resolved the problem. We are still seeing surges of cattle-cluster-agent watch start/stop events: image

These surges are correlated with high CPU and memory utilization on kube-apiserver as well as high latency in the Rancher UI and kubectl commands.

skaven81 avatar Apr 18 '23 14:04 skaven81

More data collected -- the correlation looks too close to ignore. Here's the userAgent=agent/v0.0.0 API calls. The different colored lines are different Rancher users (so the purple line is a browser session that has gone haywire with the watch resubscriptions): image And here's the same timespan but looking at the resource watch start/stop events in the cattle-cluster-agent debug log: image

This is all in Rancher 2.6.11

skaven81 avatar Apr 19 '23 16:04 skaven81

@cmurphy should I open a new issue about this, or can/should we reopen this one? Or should we abandon 2.6.x and try to upgrade to 2.7.2+?

skaven81 avatar Apr 19 '23 16:04 skaven81

@skaven81 apologies for the delayed response. It would be a good idea to upgrade to 2.7.3 but I don't know if it would help with this issue. The main fix to the watch subscription spam issue was actually in https://github.com/rancher/steve/pull/49 which went into 2.6.7, so this should have already been addressed.

In your graphs, it looks like there are predictable surges which resolve themselves in a regular pattern. In this original issue, the re-subscription attempts were increasing at a constant rate after the initial timeout, so I would expect the graphs to continue on a straight path until the pod crashes and the line would fall off a cliff to zero. I think the surges you are seeing are probably normal and the performance issues you are seeing on those intervals could be alleviated by adding more memory and CPU. If you disagree, I would recommend opening a new issue because the symptoms you've described don't fully match the original issue.

cmurphy avatar May 12 '23 22:05 cmurphy

In your graphs, it looks like there are predictable surges which resolve themselves in a regular pattern.

That's not "predictable surges" shown in the graph. The entire width of that zigzag line is a single event. Not a series of short ones. The behavior we see is that a Rancher UI browser session will hit some sort of corner case condition (probably something similar to what was fixed in rancher/steve#49 with watch subscriptions timing out), and at that point, we see a huge spike of kube-apiserver calls from the cattle-cluster-agent using the agent/v0.0.0 userAgent string. The kube-apiserver calls have the "acting as" metadata associated so we can see that the calls are being executed on behalf of the user whose browser has gone haywire. Interestingly, there is no additonal traffic coming from the web browser -- the traffic is entirely (or mostly) between cattle-cluster-agent and kube-apiserver in the downstream cluster. Not between the browser and rancher, and not between rancher and the cattle-cluster-agent.

The zig zags you see in the charts are zoomed into one of these events. It shows how the cattle-cluster-agent begins logging a series of watch stop/start events, while simultaneously hammering the kube-apiserver in the same pattern.

In this original issue, the re-subscription attempts were increasing at a constant rate after the initial timeout, so I would expect the graphs to continue on a straight path until the pod crashes and the line would fall off a cliff to zero

Yeah that's definitely not what we see. It has always been this sudden increase of kube-apiserver traffic for a single user (associated with one browser session) which then holds steady until the user closes their browser or refreshes the tab, at which point I guess the watches get re-established and kube-apiserver traffic for that user drops back to a normal level.

I'll open a new case for this.

skaven81 avatar May 22 '23 22:05 skaven81

Thanks!

cmurphy avatar May 23 '23 16:05 cmurphy

@cmurphy I have created https://github.com/rancher/rancher/issues/41663 to track the other related issue.

skaven81 avatar May 25 '23 23:05 skaven81