velero icon indicating copy to clipboard operation
velero copied to clipboard

Restic upgrade causes huge kube-apiserver memory usage on K8S controller nodes

Open fabiorauber opened this issue 3 years ago • 2 comments

What steps did you take and what happened: I have updated Velero in a cluster running Kubernetes 1.20.9 with 71 nodes, from 1.8 to 1.9, using the helm chart, which schedules a Restic pod in every node in the cluster. When the Restic pods started I saw a massive RAM usage on the 32GB RAM controller nodes, with the culprit being kube-apiserver, which was OOM killed by the OS. The only way to stop this behavior was to remove all Restic pods. Increasing RAM on these nodes to 128GB only made the OOM kills more sparse.

What did you expect to happen: A smooth upgrade and everything working nicely.

The following information will help us better understand what's going on:

Digging into the problem I found out that my cluster has 60K+ PodVolumeBackup objects, and that Restic tries to get them all as soon as it starts, as you can see below by an extract of my kube-apiserver log:

I0728 13:36:07.043675       1 trace.go:205] Trace[1738354076]: "List etcd3" key:/velero.io/podvolumebackups,resourceVersion:,resourceVersionMatch:,limit:0,continue: (28-Jul-2022 13:35:23.819) (total time: 43224ms):
Trace[1738354076]: [43.224615306s] [43.224615306s] END
I0728 13:36:07.066385       1 trace.go:205] Trace[1022876977]: "List etcd3" key:/velero.io/podvolumebackups,resourceVersion:,resourceVersionMatch:,limit:0,continue: (28-Jul-2022 13:35:22.817) (total time: 44248ms):
Trace[1022876977]: [44.248811952s] [44.248811952s] END
I0728 13:36:07.155752       1 trace.go:205] Trace[639859312]: "List etcd3" key:/velero.io/podvolumebackups,resourceVersion:,resourceVersionMatch:,limit:0,continue: (28-Jul-2022 13:35:23.979) (total time: 43176ms):
E0728 13:36:23.706313       1 wrap.go:54] timeout or abort while handling: GET "/apis/velero.io/v1/podvolumebackups"
E0728 13:36:23.709573       1 writers.go:107] apiserver was unable to write a JSON response: http: Handler timeout
E0728 13:36:23.709607       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}
E0728 13:36:23.779892       1 writers.go:120] apiserver was unable to write a fallback JSON response: http: Handler timeout
I0728 13:36:23.781103       1 trace.go:205] Trace[2035573292]: "List" url:/apis/velero.io/v1/podvolumebackups,user-agent:restic-server/v1.9.0 (linux/amd64) 6021f148c4d7721285e815a3e1af761262bff029-dirty,client:172.31.21.23 (28-Jul-2022 13:35:22.807) (total time: 60973ms):

Eventually, these several seconds long requests turn into timeouts and the problem starts to escalate until all memory on the controller node is depleted. The controller nodes are backed by SSD drives.

Anything else you would like to add:

To solve this problem, the restic pod could request PodVolumeBackup objects in a paginated fashion (if it is not doing it already), or wait a random number of seconds before requesting them.

Environment:

  • Velero version (use velero version): v1.9.0
  • Kubernetes version (use kubectl version): 1.20.9
  • Kubernetes installer & version: Rancher 2.5 RKE 1
  • Cloud provider or hardware configuration: VMware vSphere
  • OS (e.g. from /etc/os-release): Ubuntu 20.04.2 LTS

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • :+1: for "I would like to see this bug fixed as soon as possible"
  • :-1: for "There are more important bugs to focus on right now"

fabiorauber avatar Jul 28 '22 14:07 fabiorauber

After the investigation, seems that the pagination of the initial listing isn't supported by the Kubernetes API server yet and the timeout is caused by the compression according to the comment.

There is no easy fix on the Velero side at this moment. Although we can disable the compression on the Velero, this isn't a reasonable solution because this might increase the network bandwidth consumption. I'm going to remove this issue from 1.10 scope.

@fabiorauber, for your case, is it possible to clean up some useless Backups in your environment to avoid this issue? After removing the Backups, the related PVBs will also be removed.

And BTW, why your cluster has so many PVBs? How do you use Velero to do the backup?

ywk253100 avatar Sep 09 '22 02:09 ywk253100

The related upstream issue https://github.com/kubernetes/kubernetes/issues/108003

ywk253100 avatar Sep 09 '22 02:09 ywk253100

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Nov 12 '22 01:11 stale[bot]

Closing the stale issue.

stale[bot] avatar Nov 26 '22 17:11 stale[bot]