etcd
etcd copied to clipboard
Memory usage in concurrency package Mutex
Hi! I'm running an etcd 3.3.9 cluster, in Google Kubernetes Engine, as managed by etcd-operator. This cluster is used solely for locking, at any time there are 5ish locks. These tend to be short lived. There is no long term data.
We run auto-compaction every 10 minutes, I can see that happening in the logs.
Memory graphs of the cluster, show memory use increasing with a stair step pattern.
I'm not really sure what I can best provide to help diagnose whether this is a configuration error on my side, or an etcd bug.
I've started putting into in https://github.com/directionless/etcd-locking-memory-leak My test server logs are in logs
Production graphs:
Test case graphs:
Did the memory usage keep increasing or level off?
My test case was hammering my laptop, so I stopped it. I assume it would have kept growing. I'll try to get that into k8s and let it run for a day.
It's a bit harder for me to say what production is doing. It's sorta leveled off, but I don't know what the next day will bring. And, TBH, 380megs per node to hold 4 locks seems rather excessive.
I got my test code running in kubnernetes, at about 1.5h ago. And since then, the test cluster memory usages has been increasing. The nature of that code is that there should only ever be 10 active locks, and no stored data. But, memory usage keeps growing:
@directionless If you can reproduce this using just etcd without etcd-operator, and provide a way to reproduce, it will be very helpful.
@gyuho I'll try to make a full reproduce on my laptop.
etcd-operator was the easy way to get it running in kubernetes. Is there anything you want me to pull from that cluster? Any good way to get a dump for inspection?
Output from client-url/metrics
would be helpful. Would be best if it's reproducible on our side (including how you create 10 locks).
My test code is https://github.com/directionless/etcd-locking-memory-leak/blob/master/main.go Its probably simplest just to read that.
I've attached the dump of client-url/metrics
-- test-cluster-metrics.txt
I'm generating a local test case. Started etcd via run-local.sh, and running the aforementioned test code against it. So far, I'm seeing a similar memory growth. It started at 30megs, and it's now at 100megs. Despite having had a compaction in between
You have total 171161
puts. Whenever you create and acquire a lock, the key is written to the storage, so memory will spike as it holds those entries in Raft, which is expected. The memory usage will eventually come down once it gets discarded. I would try --snapshot-count 100
to see if the memory comes down.
Anyway, I don't think there's a memory leak here. We just need to find out what's the expected memory usage for lock API.
I don't really understand why the key is held across the compaction events. The lock and session are released, so once the compaction occurs shouldn't they be cleared?
I've restarted both my GKE and my local test cases with the suggested --snapshot-count 100
. I'll report back when it's been running for more than a coupe minutes
It's been running about 1.5 hours with those settings.
My local instances, started at 12megs of ram, and grew to 50megs. It's been hovering around there.
My GKE cluster started similarly, but is now up to 200megs per node, and seems to still be growing. I don't know why those are different, maybe cluster vs singleton.
I feel very confused about why compaction isn't enough here. Is that something that I can read about?
I've left these tests running over night. --snapshot-count 100
changes the sawtooth, but does not seem to dramatically change things.
Both my GKE 3 node clusters are using 350megs per node. The test one with snapshot set low has a shorter period on the sawtooth. (pink lines are the lower snapshot settings). 350 megs of memory seems quite excessive.
My local single node has been hovering right around 54megs, it's hard to it might have some slow growth, but it's hard to tell. (Forgive the less pretty graph. x axis is time, y is megs)
I meet the same problem, etcd version 3.3.1, client vesion 3.1.3
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.