etcd Memory usage in concurrency package Mutex

Hi! I'm running an etcd 3.3.9 cluster, in Google Kubernetes Engine, as managed by etcd-operator. This cluster is used solely for locking, at any time there are 5ish locks. These tend to be short lived. There is no long term data.

We run auto-compaction every 10 minutes, I can see that happening in the logs.

Memory graphs of the cluster, show memory use increasing with a stair step pattern.

I'm not really sure what I can best provide to help diagnose whether this is a configuration error on my side, or an etcd bug.

I've started putting into in https://github.com/directionless/etcd-locking-memory-leak My test server logs are in logs

Production graphs:

Test case graphs: test-case

Nov 16 '18 15:11 directionless

Did the memory usage keep increasing or level off?

Nov 16 '18 16:11 gyuho

My test case was hammering my laptop, so I stopped it. I assume it would have kept growing. I'll try to get that into k8s and let it run for a day.

It's a bit harder for me to say what production is doing. It's sorta leveled off, but I don't know what the next day will bring. And, TBH, 380megs per node to hold 4 locks seems rather excessive.

Nov 16 '18 16:11 directionless

I got my test code running in kubnernetes, at about 1.5h ago. And since then, the test cluster memory usages has been increasing. The nature of that code is that there should only ever be 10 active locks, and no stored data. But, memory usage keeps growing:

screen shot 2018-11-16 at 13 14 20

Nov 16 '18 18:11 directionless

@directionless If you can reproduce this using just etcd without etcd-operator, and provide a way to reproduce, it will be very helpful.

Nov 16 '18 18:11 gyuho

@gyuho I'll try to make a full reproduce on my laptop.

etcd-operator was the easy way to get it running in kubernetes. Is there anything you want me to pull from that cluster? Any good way to get a dump for inspection?

Nov 16 '18 18:11 directionless

Output from client-url/metrics would be helpful. Would be best if it's reproducible on our side (including how you create 10 locks).

Nov 16 '18 18:11 gyuho

My test code is https://github.com/directionless/etcd-locking-memory-leak/blob/master/main.go Its probably simplest just to read that.

I've attached the dump of client-url/metrics -- test-cluster-metrics.txt

I'm generating a local test case. Started etcd via run-local.sh, and running the aforementioned test code against it. So far, I'm seeing a similar memory growth. It started at 30megs, and it's now at 100megs. Despite having had a compaction in between

Nov 16 '18 18:11 directionless

You have total 171161 puts. Whenever you create and acquire a lock, the key is written to the storage, so memory will spike as it holds those entries in Raft, which is expected. The memory usage will eventually come down once it gets discarded. I would try --snapshot-count 100 to see if the memory comes down.

Anyway, I don't think there's a memory leak here. We just need to find out what's the expected memory usage for lock API.

Nov 16 '18 19:11 gyuho

I don't really understand why the key is held across the compaction events. The lock and session are released, so once the compaction occurs shouldn't they be cleared?

I've restarted both my GKE and my local test cases with the suggested --snapshot-count 100. I'll report back when it's been running for more than a coupe minutes

Nov 16 '18 19:11 directionless

It's been running about 1.5 hours with those settings.

My local instances, started at 12megs of ram, and grew to 50megs. It's been hovering around there.

My GKE cluster started similarly, but is now up to 200megs per node, and seems to still be growing. I don't know why those are different, maybe cluster vs singleton.

I feel very confused about why compaction isn't enough here. Is that something that I can read about?

Nov 16 '18 20:11 directionless

I've left these tests running over night. --snapshot-count 100 changes the sawtooth, but does not seem to dramatically change things.

Both my GKE 3 node clusters are using 350megs per node. The test one with snapshot set low has a shorter period on the sawtooth. (pink lines are the lower snapshot settings). 350 megs of memory seems quite excessive. screen shot 2018-11-17 at 07 44 03

My local single node has been hovering right around 54megs, it's hard to it might have some slow growth, but it's hard to tell. (Forgive the less pretty graph. x axis is time, y is megs)

local-memory

Nov 17 '18 12:11 directionless

I meet the same problem, etcd version 3.3.1, client vesion 3.1.3

Jan 24 '19 08:01 likakuli

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

Sep 21 '22 02:09 stale[bot]

etcd etcd copied to clipboard

Memory usage in concurrency package Mutex

etcd
etcd copied to clipboard