etcd icon indicating copy to clipboard operation
etcd copied to clipboard

[v2 API] Unexplained OOM issues observed on selective ETCD clusters running 3.4.x

Open harshanarayana opened this issue 2 years ago • 7 comments

Bug report criteria

What happened?

We have a single node Kubernetes cluster where we enable workload on Control plane as well since these clusters are run on On Prem Appliances. One of our such Single Node Kubernetes Deployment started running into ETCD OOM kill issue recently outof nowhere.

Our clusters are running on Kubernetes 1.18.x and ETCD 3.4.x and not heavey on Custom resoruces/operators. Keeping this in mind, we had restricted the Memory capacity of ETCD static pod at 3G. On one such cluster, we have been seeing odd increase on memory and OOM kills happening consistently in a past few weeks/months.

Here is a few more details about the Clusters we have.

  1. We are still using ETCD V2 endpoints to store non kubernetes related data/configuation on the cluster
  2. We are using python ETCD client to interact with ETCD outside of kubernetes
  3. Our Usual data size for each Key we store under ETCD V2 is around < 10k (average)
    1. There are ceratin outliers as exceptions where a few keys are of nearly 1M in size.
  4. Endpoint Status shows the DB Size to be around an average of 55M
    1. This was later reduced to nearly an average of 15M after defrag were performed on the cluster
  5. Cluster is backed by an Applaince having SSD Backed disk with nearly 750Mbps for the Disk IO

Once we started seeing OOM kills happening on the cluster we stopped the script that was responsible for triggering the OOM, we bumped up the memory allocated to the ETCD to be around 8G. It did not help much as once the pod started coming back up and got primed to serve the requests, the usage quickly shot up to around 7.7G and got OOM killed soon after. We then bumped this upto 16G to get some stability in the system until we sorted out what the issues was and how to fix it. Then the cluster stablized a bit with utilization getting capped near the upper thresholds of memory limits but without further OOM happening on the cluster.

Contain names are altered to retain some data private from the Bug.

CONTAINER ID        NAME                                                                          CPU %               MEM USAGE / LIMIT   MEM %               NET I/O             BLOCK I/O           PIDS
0516010fa85b        k8s_etcd_etcd-1.2.3.4_kube-system_9f57657c7a717d603d0a8b5af88208c3_0   32.11%              6.467GiB / 8GiB     80.84%              0B / 0B             460MB / 35.3GB      53

--- After restart
CONTAINER ID        NAME                                                                        CPU %               MEM USAGE / LIMIT   MEM %               NET I/O             BLOCK I/O           PIDS
2bf0708be8ae        k8s_etcd_etcd-1.2.3.4_kube-system_66007cabb951ff846a5d56bf87ce9566_0   0.82%               7.767GiB / 8GiB     97.09%              0B / 0B             1.28GB / 433MB      52
CONTAINER ID        NAME                                                                        CPU %               MEM USAGE / LIMIT   MEM %               NET I/O             BLOCK I/O           PIDS
2bf0708be8ae        k8s_etcd_etcd-1.2.3.4_kube-system_66007cabb951ff846a5d56bf87ce9566_0   2.28%               7.767GiB / 8GiB     97.09%              0B / 0B             1.28GB / 433MB      52
CONTAINER ID        NAME                                                                        CPU %               MEM USAGE / LIMIT   MEM %               NET I/O             BLOCK I/O           PIDS
2bf0708be8ae        k8s_etcd_etcd-1.2.3.4_kube-system_66007cabb951ff846a5d56bf87ce9566_0   2.28%               7.767GiB / 8GiB     97.09%              0B / 0B             1.28GB / 433MB      52
CONTAINER ID        NAME                                                                        CPU %               MEM USAGE / LIMIT   MEM %               NET I/O             BLOCK I/O           PIDS
2bf0708be8ae        k8s_etcd_etcd-1.2.3.4_kube-system_66007cabb951ff846a5d56bf87ce9566_0   0.65%               7.767GiB / 8GiB     97.09%              0B / 0B             1.28GB / 433MB      52 

We performed a few pprof checks to find the following

**(pprof) top
Showing nodes accounting for 3932.81MB, 99.50% of 3952.39MB total
Dropped 48 nodes (cum <= 19.76MB)
Showing top 10 nodes out of 35
      flat  flat%   sum%        cum   cum%
 2256.76MB 57.10% 57.10%  2256.76MB 57.10%  bytes.makeSlice
 1199.99MB 30.36% 87.46%  1199.99MB 30.36%  go.etcd.io/etcd/raft/raftpb.(*Entry).Unmarshal
  476.06MB 12.04% 99.50%   476.06MB 12.04%  go.etcd.io/etcd/etcdserver/etcdserverpb.(*Request).Unmarshal
         0     0% 99.50%  2256.76MB 57.10%  bytes.(*Buffer).WriteString
         0     0% 99.50%  2256.76MB 57.10%  bytes.(*Buffer).grow
         0     0% 99.50%  2256.76MB 57.10%  encoding/json.(*encodeState).marshal
         0     0% 99.50%  2256.76MB 57.10%  encoding/json.(*encodeState).reflectValue
         0     0% 99.50%  2256.76MB 57.10%  encoding/json.(*encodeState).string
         0     0% 99.50%  2256.76MB 57.10%  encoding/json.Marshal
         0     0% 99.50%  2256.76MB 57.10%  encoding/json.arrayEncoder.encode
(pprof) **

While looking around in Github issues under ETCD, we ran into the following

https://github.com/etcd-io/etcd/issues/13918 https://github.com/etcd-io/etcd/issues/13918

#13918 Claims that some of the issues around OOM has been fixed in the 3.4.21+ of ETCD. So went about updating an internal cluster with 3.4.27 which is the latest available version on the 3.4.x train of the release. We ran into something slightly more severe in this (We think it is much severe).

In order to evaluate the behavior of OOM with new version of ETCD, We started wth the same upper limit on Memory (2G) and manually ran a command to simulate an OOM by writing an ETCD V2 key with a size of 1M in a loop. This caused the OOM of ETCD in near no time. Once the ETCD started getting OOM killed, we bumped up the resource to 16G and noticed that it was still getting OOM killed. Since we had a really large cluster to experiment with, we removed all the limits on the static pod. Which now had 275G of memory under its disposal.

root@maglev-master-100-0-0-100:/tmp/build# docker stats 2800fabc7a79 --no-stream
CONTAINER ID        NAME                                                                       CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
2800fabc7a79        k8s_etcd_etcd-1.2.3.4_kube-system_9f423fc93a7785eaf94df64a8b295f0c_2   6.63%               61.28GiB / 273.5GiB   22.40%              0B / 0B             3.31MB / 10.3GB     60
root@maglev-master-100-0-0-100:/tmp/build# 

After a while(I did not track how long this took. But it might have been around 15m/20m or so), without having to do anything more, the usage started reducing and came down to nearly 9G.

Following was the state of the affairs on teh disk when the above behaviors were being seen on the cluster.

-rw------- 1 root root 62M Aug 31 08:29 /data/etcd/member/wal/0000000000000213-00000000034b5124.wal
-rw------- 1 root root 62M Aug 31 08:27 /data/etcd/member/wal/0000000000000212-00000000034acc7d.wal
-rw------- 1 root root 62M Aug 31 08:30 /data/etcd/member/wal/0000000000000218-00000000034b5532.wal
-rw------- 1 root root 62M Aug 31 08:30 /data/etcd/member/wal/0000000000000216-00000000034b5413.wal
-rw------- 1 root root 62M Aug 31 09:55 /data/etcd/member/wal/0.tmp
-rw------- 1 root root 62M Aug 31 08:31 /data/etcd/member/wal/000000000000021a-00000000034b5645.wal
-rw------- 1 root root 62M Aug 31 08:31 /data/etcd/member/wal/0000000000000219-00000000034b55bb.wal
-rw------- 1 root root 62M Aug 31 08:31 /data/etcd/member/wal/000000000000021b-00000000034b56d3.wal
-rw------- 1 root root 62M Aug 31 08:30 /data/etcd/member/wal/0000000000000215-00000000034b5386.wal
-rw------- 1 root root 62M Aug 31 08:32 /data/etcd/member/wal/1.tmp
-rw------- 1 root root 62M Aug 31 08:30 /data/etcd/member/wal/0000000000000217-00000000034b54a3.wal
-rw------- 1 root root 62M Aug 31 08:29 /data/etcd/member/wal/0000000000000214-00000000034b52c9.wal
-rw------- 1 root root 62M Aug 31 08:32 /data/etcd/member/wal/000000000000021c-00000000034b5762.wal
-rw------- 1 root root 62M Aug 31 10:06 /data/etcd/member/wal/000000000000021d-00000000034b57f1.wal
-rw-r--r-- 1 root root 3.1M Aug 31 09:55 /data/etcd/member/snap/0000000000000015-00000000034ac901.snap
-rw-r--r-- 1 root root 3.1M Aug 31 09:55 /data/etcd/member/snap/0000000000000016-00000000034b500f.snap
-rw-r--r-- 1 root root 3.1M Aug 31 09:55 /data/etcd/member/snap/0000000000000015-00000000034b2615.snap
-rw-r--r-- 1 root root 3.1M Aug 31 09:55 /data/etcd/member/snap/0000000000000015-00000000034af76a.snap
-rw------- 1 root root 5.7M Aug 31 10:06 /data/etcd/member/snap/db
-rw-r--r-- 1 root root 2.6G Aug 31 10:06 /data/etcd/member/snap/0000000000000017-00000000034b7e3a.snap
-rw-r--r-- 1 root root 3.1M Aug 31 05:36 /data/etcd/member/snap/0000000000000015-00000000034a9a61.snap
-rw-r--r-- 1 root root 1.3G Aug 31 10:06 /data/etcd/member/snap/0000000000000017-00000000034babff.snap
-rw------- 1 root root 62M Aug 31 09:55 /data/etcd/member/wal/0.tmp
-rw------- 1 root root 62M Aug 31 08:31 /data/etcd/member/wal/000000000000021a-00000000034b5645.wal
-rw------- 1 root root 62M Aug 31 08:31 /data/etcd/member/wal/0000000000000219-00000000034b55bb.wal
-rw------- 1 root root 62M Aug 31 08:31 /data/etcd/member/wal/000000000000021b-00000000034b56d3.wal
-rw------- 1 root root 62M Aug 31 08:32 /data/etcd/member/wal/1.tmp
-rw------- 1 root root 62M Aug 31 08:32 /data/etcd/member/wal/000000000000021c-00000000034b5762.wal
-rw------- 1 root root 62M Aug 31 10:13 /data/etcd/member/wal/000000000000021d-00000000034b57f1.wal
-rw-r--r-- 1 root root 3.1M Aug 31 09:55 /data/etcd/member/snap/0000000000000016-00000000034b500f.snap
-rw-r--r-- 1 root root 3.1M Aug 31 09:55 /data/etcd/member/snap/0000000000000015-00000000034b2615.snap
-rw-r--r-- 1 root root 3.1M Aug 31 09:55 /data/etcd/member/snap/0000000000000015-00000000034af76a.snap
-rw------- 1 root root 5.7M Aug 31 10:13 /data/etcd/member/snap/db
-rw-r--r-- 1 root root 5.2G Aug 31 10:06 /data/etcd/member/snap/0000000000000017-00000000034b7e3a.snap
-rw-r--r-- 1 root root 5.2G Aug 31 10:06 /data/etcd/member/snap/0000000000000017-00000000034babff.snap
root@maglev-master-100-0-0-100:/tmp/build# export ETCDCTL_API=3
root@maglev-master-100-0-0-100:/tmp/build# etcdctl endpoint status -w table --cluster
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT         |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://1.2.3.4:4001 | c72edf3479ac4a35 |  3.4.27 |  6.0 MB |      true |      false |        39 |   55294523 |           55294523 |        |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
root@maglev-master-100-0-0-100:/tmp/build# 

We tried experimenting with tuning of the following components with not much luck in terms of getting a better behavior.

  1. Set the --snapshot-count flag to 10000
  2. Enable auto-compaction-mode with 30m periodic mode.

This slightly gave us a better behaviors but did not make any promising changes.

Then we stumbled upon https://github.com/etcd-io/etcd/issues/12548 which explained the implication of DefaultSnapshotCatchUpEntries being set to 5000 and how it impacts the cluster when the Data stored in ETCD keys are higher and they are revisioned. However, in our cluster, other than 1 outlier key with higher size, there are nothing with considerable size as mentioned above.

We went and created a custom version of ETCD Binary by changing the catchup entries to be 1k instead of 5k to see if the behavior is any different and we did not get any better results.

I have attached a few PPROF data along with this issue and we can share more details on information as required by the community.

Aug_11-14-48-42_heap.tar(1).gz Aug_11-14-48-48_profile.tar(1).gz heap.tar(1).gz profile.tar(1).gz profile001.pdf profile002.pdf

What did you expect to happen?

  1. ETCD Should have a predictable OOM behaivor that can be qualified using certain criteria to be able to better resource allocate cluster workloads. (We just want to understand if there is a way to identify this, what exactly is that mechanism) ?
  2. ETCD once OOM'ed should be able to recover gracefully on restart of the container without having to cause further OOMs again if the trigger that caused the initial OOM is no more running.
  3. It would be really nice to make the catchup entries configurable so that we can do so based on our requirements and cluster performance metrics.

How can we reproduce it (as minimally and precisely as possible)?

while [ 1 ]; do dd if=/dev/urandom bs=1M count=1 | ETCDCTL_API=2 etcdctl set /tmp/config/keyTest > /dev/null || break; done

while [ 1 ]; do dd if=/dev/urandom bs=1M count=1 | ETCDCTL_API=3 etcdctl put key || break; done

Anything else we need to know?

  1. On Constantly running a loop of write to the same key with smaller value, the Memory usage comes down eventually at a slower rage. while [ 1 ]; do dd if=/dev/urandom bs=1 count=1 | ETCDCTL_API=2 etcdctl set /tmp/config/keyTest > /dev/null || break; done
  2. We have on an average of around 200 ETCD V2 keys.
  3. We were able to see similar behavior with ETCD 3.5 as well with the steps mentioned above in the reproduction mechanism.

Etcd version (please run commands below)

$ etcd --version
etcd Version: 3.4.27
Git SHA: c92fb80f3
Go Version: go1.19.10
Go OS/Arch: linux/amd64

$ etcdctl version
etcdctl version: 3.4.27
API version: 2

Etcd configuration (command line flags or environment variables)

apiVersion: v1 kind: Pod metadata: name: etcd labels: k8s-app: etcd namespace: kube-system spec: shareProcessNamespace: true hostNetwork: true dnsPolicy: ClusterFirstWithHostNet containers: - name: etcd image: registry.images.svc.cluster.local:5000/etcd:3.4.27 livenessProbe: httpGet: path: /health port: 2323 scheme: HTTP initialDelaySeconds: 15 failureThreshold: 40 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 5 command: - "/usr/local/bin/etcd" - "--data-dir=/var/lib/etcd" - "--advertise-client-urls=https://1.2.3.4:4001" - "--listen-client-urls=https://1.2.3.4:4001,https://localhost:4001" - "--listen-peer-urls=https://1.2.3.4:4002,https://localhost:4002" - "--initial-advertise-peer-urls=https://1.2.3.4:4002" - "--initial-cluster=etcd-1.2.3.4=https://1.2.3.4:4002" - "--initial-cluster-state=new" - "--name=etcd-1.2.3.4" - "--election-timeout=5000" - "--enable-v2" - "--peer-cert-allowed-cn=etcd-peer" - "--cert-file=/ephemeral-volume-mount/etcd-client.pem" - "--key-file=/ephemeral-volume-mount/etcd-client-key.pem" - "--trusted-ca-file=/ephemeral-volume-mount/ca.crt" - "--peer-trusted-ca-file=/ephemeral-volume-mount/ca.crt" - "--peer-cert-file=/ephemeral-volume-mount/etcd-peer.pem" - "--peer-key-file=/ephemeral-volume-mount/etcd-peer-key.pem" - "--peer-client-cert-auth=true" - "--client-cert-auth=true" - "--listen-metrics-urls=http://localhost:2323,http://1.2.3.4:2323" - "--log-level=debug" resources: requests: cpu: 1500m memory: 3Gi securityContext: privileged: true ports: - name: client containerPort: 4001 hostPort: 4001 - name: metrics containerPort: 2323 hostPort: 2323 volumeMounts: - mountPath: /var/lib/etcd name: varetcd - mountPath: /etc/certs/.pki name: certs readOnly: true - mountPath: /ephemeral-volume-mount name: ephemeral-volume-mount volumes: - name: "varetcd" hostPath: path: "/data/etcd" - name: certs hostPath: path: "/etc/certs/.pki" - name: ephemeral-volume-mount emptyDir: medium: Memory - name: key-manager-socket hostPath: path: /var/run

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
+------------------+---------+-------------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |       NAME        |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+-------------------+---------------------------+---------------------------+------------+
| c6b0f07addff2d04 | started | etcd-1.2.3.4 | https://1.2.3.4:4002 | https://1.2.3.4:4001 |      false |
+------------------+---------+-------------------+---------------------------+---------------------------+------------+



$ etcdctl --endpoints=<member list> endpoint status -w table
root@maglev-master-100-0-0-100:/tmp/build# etcdctl endpoint status -w table --cluster
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT         |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://1.2.3.4:4001 | c6b0f07addff2d04 |  3.4.27 |  6.0 MB |      true |      false |        39 |   55294523 |           55294523 |        |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Relevant log output

No response

harshanarayana avatar Sep 15 '23 13:09 harshanarayana

cc @rajivginotra

harshanarayana avatar Sep 15 '23 13:09 harshanarayana

v2 store is deprecated, could you try the new v3 backend storage by using gRPC API in the etcd client SDK?

ahrtr avatar Sep 15 '23 13:09 ahrtr

We see such things very often when operators misbehave:

      flat  flat%   sum%        cum   cum%
 2256.76MB 57.10% 57.10%  2256.76MB 57.10%  bytes.makeSlice
 1199.99MB 30.36% 87.46%  1199.99MB 30.36%  go.etcd.io/etcd/raft/raftpb.(*Entry).Unmarshal
  476.06MB 12.04% 99.50%   476.06MB 12.04%  go.etcd.io/etcd/etcdserver/etcdserverpb.(*Request).Unmarshal

I think the most common is similar to kubectl get pods --all-namespaces - listing many resources at once. This doesn't necessarily explain your leakage, but at least why the large amount of memory is consumed.

tjungblu avatar Sep 15 '23 13:09 tjungblu

v2 store is deprecated, could you try the new v3 backend storage by using gRPC API in the etcd client SDK?

We can't switch to that unfortunately. There are workflow that we have built and dependencies that make us stick to V2 APIs. That being said, we see similar behavior with V3 keys as well @ahrtr

harshanarayana avatar Sep 15 '23 14:09 harshanarayana

We see such things very often when operators misbehave:

The cluster where we see this, we do not have major operators deployed on it. There are a couple of them (3 to be specific) but mostly of them are dormant until they have some actual workflow to process. @tjungblu

harshanarayana avatar Sep 15 '23 14:09 harshanarayana

i've met this problem in 3.5.6,db size is 50MB, no keys has large value, watcher number is small, disk sync is fast, cpu usage is low, and the etcd keeping oom is slave, by inspecting pprof heap, most of memory consumed by json encode, the top node is json ptrEncoder encode the following is one of the calling stack 869501801c2ccf16bb86f15c97a129f

EchoUtopia avatar Feb 08 '24 07:02 EchoUtopia

We solved this problem by setting the following environment variable for the container in etcd v3.5.6

  - name: "GOGC"
    value: "30"

Fish-pro avatar Feb 18 '24 08:02 Fish-pro