etcd
etcd copied to clipboard
etcd memory usage spikes to unsustainable levels and OOMs
What happened?
Whenever I run my etcd container, its memory usage slowly goes up to meet the docker limit on my machine (~7.6 GB) and then it OOMs. I've investigated and believe this is because it has created a large amount of wal files, above the limit (5), and when it has to process these it breaks down.
I've attached etcd docker logs.
What did you expect to happen?
etcd would run normally.
How can we reproduce it (as minimally and precisely as possible)?
I had inserted lots of data and now when I launch etcd it seems it can simply not handle the saved .wal files.
Anything else we need to know?
ls -has on the etcd wal directory:
total 4.1G
4.0K . 62M 000000000000007b-0000000000050618.wal 62M 0000000000000092-0000000000056f14.wal
4.0K .. 62M 000000000000007c-0000000000050b41.wal 62M 0000000000000093-00000000000573a7.wal
62M 0.tmp 62M 000000000000007d-0000000000051024.wal 62M 0000000000000094-0000000000057988.wal
62M 0000000000000067-0000000000049181.wal 62M 000000000000007e-000000000005160d.wal 62M 0000000000000095-0000000000057d93.wal
62M 0000000000000068-00000000000497f2.wal 62M 000000000000007f-0000000000051baf.wal 62M 0000000000000096-0000000000058198.wal
62M 0000000000000069-0000000000049deb.wal 62M 0000000000000080-00000000000520fa.wal 62M 0000000000000097-00000000000585b8.wal
62M 000000000000006a-000000000004a37d.wal 62M 0000000000000081-0000000000052873.wal 62M 0000000000000098-00000000000589c5.wal
62M 000000000000006b-000000000004aa3e.wal 62M 0000000000000082-0000000000052e7d.wal 62M 0000000000000099-0000000000058d89.wal
62M 000000000000006c-000000000004af7c.wal 62M 0000000000000083-000000000005333f.wal 62M 000000000000009a-000000000005913b.wal
62M 000000000000006d-000000000004b58c.wal 62M 0000000000000084-000000000005380b.wal 62M 000000000000009b-000000000005948f.wal
62M 000000000000006e-000000000004bae9.wal 62M 0000000000000085-0000000000053c20.wal 62M 000000000000009c-0000000000059815.wal
62M 000000000000006f-000000000004c0d3.wal 62M 0000000000000086-0000000000054004.wal 62M 000000000000009d-0000000000059c02.wal
62M 0000000000000070-000000000004c6af.wal 62M 0000000000000087-00000000000543ef.wal 62M 000000000000009e-000000000005a010.wal
62M 0000000000000071-000000000004ccb6.wal 62M 0000000000000088-00000000000547fe.wal 62M 000000000000009f-000000000005a3c1.wal
62M 0000000000000072-000000000004d2e7.wal 62M 0000000000000089-0000000000054b58.wal 62M 00000000000000a0-000000000005a7a5.wal
62M 0000000000000073-000000000004d82a.wal 62M 000000000000008a-0000000000054f52.wal 62M 00000000000000a1-000000000005ab92.wal
62M 0000000000000074-000000000004de82.wal 62M 000000000000008b-00000000000552a1.wal 62M 00000000000000a2-000000000005b000.wal
62M 0000000000000075-000000000004e429.wal 62M 000000000000008c-00000000000555ed.wal 62M 00000000000000a3-000000000005b37c.wal
62M 0000000000000076-000000000004e9d9.wal 62M 000000000000008d-00000000000559a6.wal 62M 00000000000000a4-000000000005b744.wal
62M 0000000000000077-000000000004ef8e.wal 62M 000000000000008e-0000000000055d8f.wal 62M 00000000000000a5-000000000005bb88.wal
62M 0000000000000078-000000000004f5db.wal 62M 000000000000008f-0000000000056252.wal 62M 00000000000000a6-000000000005bf4b.wal
62M 0000000000000079-000000000004fb26.wal 62M 0000000000000090-000000000005666c.wal 62M 00000000000000a7-000000000005c32d.wal
62M 000000000000007a-00000000000500bc.wal 62M 0000000000000091-0000000000056a5b.wal 62M 00000000000000a8-000000000005c6d5.wal
Etcd version (please run commands below)
etcd v3.5.0 from docker
Etcd configuration (command line flags or environment variables)
- ETCD_AUTO_COMPACTION_MODE=revision
- ETCD_AUTO_COMPACTION_RETENTION=1000
- ETCD_QUOTA_BACKEND_BYTES=4294967296
- ETCD_ENABLE_PPROF=true
- ETCD_SNAPSHOT_COUNT=50000
- ETCD_MAX_SNAPSHOTS=2
- ETCD_MAX_WALS=5
command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd
Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)
I cannot run these commands because etcd hangs and crashes.
Relevant log output
No response
The OOM is a known issue, the workaroud AFAIK are:
- perform compaction & defragmentation to reclaim the db file space;
- avoid big range operation on the client side.
With respect to the WAL files, it looks like an issue because the max WAL file is 5, but there are lots of WAL files. It means that the old WAL files failed to be purged. I checked the log file you attached, but did not see anything useful, the reason should be you attachd isn't the complete log. Please try to reproduce the issue and attached the complete log if possible.
Can I just delete the WAL files?
Can I just delete the WAL files?
It isn't recommended to manually delete the WAL files, otherwise the WAL files may not be matching the snap files. Please try to reproduce the issue and provide complete log. If you are interested, please try to figure out why etcd failed to purge the WAL files automatically.
When I try getting debug info:
halcyon@espial:~/milvus$ sudo docker exec milvus-etcd etcdctl member list -w table
{"level":"warn","ts":"2022-08-27T05:34:10.842Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0002f2000/#initially=[127.0.0.1:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded
When I try to get the logs:
{"level":"info","ts":1661578407.7117872,"caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_AUTO_COMPACTION_MODE","variable-value":"revision"}
{"level":"info","ts":1661578407.722684,"caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_AUTO_COMPACTION_RETENTION","variable-value":"1000"}
{"level":"info","ts":1661578407.7227707,"caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_ENABLE_PPROF","variable-value":"true"}
{"level":"info","ts":1661578407.7228498,"caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_MAX_SNAPSHOTS","variable-value":"2"}
{"level":"info","ts":1661578407.7228956,"caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_MAX_WALS","variable-value":"5"}
{"level":"info","ts":1661578407.7229593,"caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_QUOTA_BACKEND_BYTES","variable-value":"4294967296"}
{"level":"info","ts":1661578407.7230117,"caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_SNAPSHOT_COUNT","variable-value":"50000"}
{"level":"info","ts":"2022-08-27T05:33:27.723Z","caller":"etcdmain/etcd.go:72","msg":"Running: ","args":["etcd","-advertise-client-urls=http://127.0.0.1:2379","-listen-client-urls","http://0.0.0.0:2379","--data-dir","/etcd"]}
{"level":"info","ts":"2022-08-27T05:33:27.727Z","caller":"etcdmain/etcd.go:115","msg":"server has been already initialized","data-dir":"/etcd","dir-type":"member"}
{"level":"info","ts":"2022-08-27T05:33:27.727Z","caller":"embed/etcd.go:131","msg":"configuring peer listeners","listen-peer-urls":["http://localhost:2380"]}
{"level":"info","ts":"2022-08-27T05:33:27.738Z","caller":"embed/etcd.go:139","msg":"configuring client listeners","listen-client-urls":["http://0.0.0.0:2379"]}
{"level":"info","ts":"2022-08-27T05:33:27.738Z","caller":"embed/etcd.go:598","msg":"pprof is enabled","path":"/debug/pprof"}
{"level":"info","ts":"2022-08-27T05:33:27.740Z","caller":"embed/etcd.go:307","msg":"starting an etcd server","etcd-version":"3.5.0","git-sha":"946a5a6f2","go-version":"go1.16.3","go-os":"linux","go-arch":"amd64","max-cpu-set":2,"max-cpu-available":2,"member-initialized":true,"name":"default","data-dir":"/etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"/etcd/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":50000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["http://localhost:2380"],"listen-peer-urls":["http://localhost:2380"],"advertise-client-urls":["http://127.0.0.1:2379"],"listen-client-urls":["http://0.0.0.0:2379"],"listen-metrics-urls":[],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"new","initial-cluster-token":"","quota-size-bytes":4294967296,"pre-vote":true,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","auto-compaction-mode":"revision","auto-compaction-retention":"1µs","auto-compaction-interval":"1µs","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
{"level":"warn","ts":1661578407.740217,"caller":"fileutil/fileutil.go:57","msg":"check file permission","error":"directory \"/etcd\" exist, but the permission is \"drwxr-xr-x\". The recommended permission is \"-rwx------\" to prevent possible unprivileged access to the data"}
{"level":"info","ts":"2022-08-27T05:33:30.958Z","caller":"etcdserver/backend.go:81","msg":"opened backend db","path":"/etcd/member/snap/db","took":"3.207592575s"}
{"level":"info","ts":"2022-08-27T05:34:47.258Z","caller":"etcdserver/server.go:508","msg":"recovered v2 store from snapshot","snapshot-index":300003,"snapshot-size":"7.1 kB"}
{"level":"info","ts":"2022-08-27T05:34:47.258Z","caller":"etcdserver/server.go:518","msg":"recovered v3 backend from snapshot","backend-size-bytes":123691008,"backend-size":"124 MB","backend-size-in-use-bytes":68882432,"backend-size-in-use":"69 MB"}
I'm trying to get more info but etcd hogs memory and I can't even use my system properly. Does this help at all? Is there any way I can just reset it so it keeps existing data and ignores the problematic stuff?
The huge memory usage might be caused by the db file size.
What's the size of the db file, which locates in ${DATA_DIR}/member/snap/db
? Please try to perform compaction + defragmentation per guide https://etcd.io/docs/v3.5/op-guide/maintenance/#space-quota
I just added more log for debugging the reason why etcd fails to purge WAL file.
https://github.com/etcd-io/etcd/pull/14576
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.