out of memory
I've been having some issues hitting hard RAM limit on my box (swap is disabled). I noticed this message in the output of journalctl:
Jan 15 21:14:07 bris01 kernel: Out of memory: Kill process 940 (vulcand) score 393 or sacrifice child
Jan 15 21:14:07 bris01 kernel: Killed process 940 (vulcand) total-vm:415440kB, anon-rss:401040kB, file-rss:0kB
RSS of ~400MB seems excessive. This instance only receives very light traffic, so I'm guessing perhaps there is a memory leak? Sorry I don't have any other data to back that up at this stage.
I've just re-built vulcand from latest commit, just to be sure I'm not hitting an old bug. I'll continue to monitor.
Wow, that's a serious one. I'll spend some time digging into it, thanks for reporting!
I think I've found it. Anyways, try it (v0.8.0-beta.2) out and if you sill experience it, feel free to reopen.
@klizhentas I haven't tried v0.8.0-beta.2 yet, however my server has been up for the last 16hrs running pprof. I just ran a heap analysis:
(pprof) top 10
Total: 350.0 MB
324.0 92.6% 92.6% 324.0 92.6% github.com/mailgun/vulcand/Godeps/_workspace/src/github.com/codahale/hdrhistogram.New
4.0 1.1% 93.7% 328.0 93.7% github.com/mailgun/vulcand/Godeps/_workspace/src/github.com/mailgun/oxy/memmetrics.NewHDRHistogram
3.5 1.0% 94.7% 3.5 1.0% encoding/json.(*decodeState).literalStore
2.0 0.6% 95.3% 2.0 0.6% allocg
2.0 0.6% 95.9% 2.0 0.6% bufio.NewWriter
1.5 0.4% 96.3% 1.5 0.4% bufio.NewReaderSize
1.5 0.4% 96.7% 1.5 0.4% github.com/mailgun/vulcand/Godeps/_workspace/src/github.com/mailgun/route.(*trieNode).parseExpression
1.0 0.3% 97.0% 1.0 0.3% encoding/pem.Decode
1.0 0.3% 97.3% 1.0 0.3% github.com/mailgun/vulcand/Godeps/_workspace/src/github.com/mailgun/oxy/memmetrics.NewCounter
1.0 0.3% 97.6% 1.0 0.3% github.com/mailgun/vulcand/Godeps/_workspace/src/github.com/mailgun/oxy/roundrobin.NewRebalancer
From that, it looks like hdrhistogram is the culprit!
Yep, that happened because on every reload vulcand did not clean up the histograms because the go routine kept a reference to them, so try the new version out to see if it's any better
Ah, ok - I'll give it a try! Thanks
Just FYI, my server has been running for 10hrs and the memory usage looks much better since the fix. Heap analysis is as follows:
(pprof) top 10
Total: 9.0 MB
3.5 38.9% 38.9% 3.5 38.9% bufio.NewReaderSize
3.0 33.3% 72.2% 3.0 33.3% github.com/mailgun/vulcand/Godeps/_workspace/src/github.com/codahale/hdrhistogram.New
1.0 11.1% 83.3% 1.0 11.1% makemap_c
0.5 5.6% 88.9% 0.5 5.6% allocg
0.5 5.6% 94.4% 0.5 5.6% net/http.(*Transport).getIdleConnCh
0.5 5.6% 100.0% 0.5 5.6% regexp/syntax.(*compiler).rune
0.0 0.0% 100.0% 0.5 5.6% _rt0_go
Thanks for the update!
I've been running docker containers mailgun/vulcand:v0.8.0-beta.2 and most recently mailgun/vulcand:v0.8.0-beta.3. Still experiencing really bad memory leaks. After 24 hours, it's back up to about 4.5GiB ram.
CONTAINER CPU % MEM USAGE/LIMIT MEM % NET I/O
central_vulcand 0.70% 4.55 GiB/62.96 GiB 7.23% 1.675 GiB/1.986 GiB
Worse yet, yesterday when I killed the container and restarted it, all the frontends were unavailable. I had to re-announce all the services for them to take effect. I think Vulcand somehow corrupted the etcd entries.
Erik,
I can release a patched version just to get goroutine dumps periodically, would you be able to install it and run for a a couple of hours?
In addition to that, are there any outstanding logs that Vulcand produces?
Thanks, Sasha
@klizhentas sure! Any chance you could push it as a container to Docker Hub?
Fwiw, using Registrator+SkyDNS.
Here's how I announce
ExecStart=/usr/bin/docker exec central_vulcand /go/bin/vctl --vulcan 'http://localhost:8192' backend upsert -id ib_web
Here's how I set presence
ExecStart=/usr/bin/docker exec central_vulcand /go/bin/vctl --vulcan 'http://localhost:8192' server upsert -b ib_web -id ib_web_server -url http://web.ib.cloudposse.local:80
Here's how I set the vHost:
ExecStart=/usr/bin/docker exec central_vulcand /go/bin/vctl --vulcan 'http://localhost:8192' frontend upsert -id %i -b ib_web -route 'Host(`%i`) && PathRegexp(`.*`)'
Here's how I run vulcand:
ExecStart=/usr/bin/docker run \
--dns ${COREOS_PRIVATE_IPV4} \
--rm \
--name central_vulcand \
--memory=10g \
--cpu-shares=200 \
-e "SERVICE_NAME=central" \
-e "SERVICE_ID=vulcand" \
-p 80:80 \
mailgun/vulcand:v0.8.0-beta.3 \
/go/bin/vulcand -apiInterface=0.0.0.0 -interface=0.0.0.0 -etcd=http://${COREOS_PRIVATE_IPV4}:4001 -port=80 -apiPort=8192
I do see some errors like these... no idea if they are related.
Apr 17 00:12:36 core-01 docker[1226]: Apr 17 00:12:36.461: ERROR PID:1 [fwd.go:89] Error forwarding to http://web.ib.cloudposse.local:80, err: net/http: timeout awaiting response headers
Apr 17 00:13:27 core-01 docker[1226]: Apr 17 00:13:27.200: ERROR PID:1 [etcd.go:417] unexpected error: unexpected end of JSON input, stop watching
Apr 17 00:13:36 core-01 docker[1226]: Apr 17 00:13:36.669: ERROR PID:1 [fwd.go:89] Error forwarding to http://web.ib.cloudposse.local:80, err: net/http: timeout awaiting response headers
Apr 17 00:14:36 core-01 docker[1226]: Apr 17 00:14:36.671: ERROR PID:1 [fwd.go:89] Error forwarding to http://web.ib.cloudposse.local:80, err: net/http: timeout awaiting response headers
Apr 17 00:15:36 core-01 docker[1226]: Apr 17 00:15:36.832: ERROR PID:1 [fwd.go:89] Error forwarding to http://web.ib.cloudposse.local:80, err: net/http: timeout awaiting response headers
Apr 17 00:16:36 core-01 docker[1226]: Apr 17 00:16:36.833: ERROR PID:1 [fwd.go:89] Error forwarding to http://web.ib.cloudposse.local:80, err: net/http: timeout awaiting response headers
Apr 17 00:17:36 core-01 docker[1226]: Apr 17 00:17:36.928: ERROR PID:1 [fwd.go:89] Error forwarding to http://web.ib.cloudposse.local:80, err: net/http: timeout awaiting response headers
Apr 17 00:18:32 core-01 docker[1226]: Apr 17 00:18:32.256: ERROR PID:1 [etcd.go:417] unexpected error: unexpected end of JSON input, stop watching
Apr 17 00:18:36 core-01 docker[1226]: Apr 17 00:18:36.929: ERROR PID:1 [fwd.go:89] Error forwarding to http://web.ib.cloudposse.local:80, err: net/http: timeout awaiting response headers
Apr 17 00:19:37 core-01 docker[1226]: Apr 17 00:19:37.024: ERROR PID:1 [fwd.go:89] Error forwarding to http://web.ib.cloudposse.local:80, err: net/http: timeout awaiting response headers
Sure, I will prepare the build on docker hub and will let you know, meanwhile I will reopen the bug
Whats your etcd version?
I'm running CoreOS 647.0.0 which is pinned at etcd 0.4.9. Unfortunately, I had to cut-bait on vulcand a few weeks ago as I didn't see any updates. I'm now running Nginx+confd, so I won't be able to help diagnose these memory leak issues any longer.
I just found I'm running into this as well, using v0.8.0-beta.3. I really like vulcand, and don't want to switch to confd. I'm running coreos 681.2.0, etcd version 0.4.9. Interestingly, my logs show that fleetd was having trouble connecting to etcd immediately before vulcand is killed. Could that be a symptom of the high RAM usage, or a cause of something happening wrong in vulcan?
Jun 24 15:20:26 coreos-01 fleetd[587]: ERROR engine.go:149: Unable to determine cluster engine version
Jun 24 15:20:27 coreos-01 fleetd[587]: INFO client.go:292: Failed getting response from http://localhost:4001/: cancelled
Jun 24 15:20:29 coreos-01 kernel: etcd invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
...
Jun 24 15:20:29 coreos-01 kernel: Out of memory: Kill process 27077 (vulcand) score 777 or sacrifice child
Jun 24 15:20:29 coreos-01 kernel: Killed process 27077 (vulcand) total-vm:589868kB, anon-rss:392300kB, file-rss:0kB
Since I'm using etcd 2.1 on my CoreOS install I have no memory leak within vulcand anymore.
Hmm, might see if upgrading that helps.