vulcand out of memory

I've been having some issues hitting hard RAM limit on my box (swap is disabled). I noticed this message in the output of journalctl:

Jan 15 21:14:07 bris01 kernel: Out of memory: Kill process 940 (vulcand) score 393 or sacrifice child
Jan 15 21:14:07 bris01 kernel: Killed process 940 (vulcand) total-vm:415440kB, anon-rss:401040kB, file-rss:0kB

RSS of ~400MB seems excessive. This instance only receives very light traffic, so I'm guessing perhaps there is a memory leak? Sorry I don't have any other data to back that up at this stage.

Jan 15 '15 21:01 porjo

I've just re-built vulcand from latest commit, just to be sure I'm not hitting an old bug. I'll continue to monitor.

Jan 15 '15 21:01 porjo

Wow, that's a serious one. I'll spend some time digging into it, thanks for reporting!

Jan 15 '15 23:01 klizhentas

I think I've found it. Anyways, try it (v0.8.0-beta.2) out and if you sill experience it, feel free to reopen.

Jan 16 '15 22:01 klizhentas

@klizhentas I haven't tried v0.8.0-beta.2 yet, however my server has been up for the last 16hrs running pprof. I just ran a heap analysis:

(pprof) top 10
Total: 350.0 MB
   324.0  92.6%  92.6%    324.0  92.6% github.com/mailgun/vulcand/Godeps/_workspace/src/github.com/codahale/hdrhistogram.New
     4.0   1.1%  93.7%    328.0  93.7% github.com/mailgun/vulcand/Godeps/_workspace/src/github.com/mailgun/oxy/memmetrics.NewHDRHistogram
     3.5   1.0%  94.7%      3.5   1.0% encoding/json.(*decodeState).literalStore
     2.0   0.6%  95.3%      2.0   0.6% allocg
     2.0   0.6%  95.9%      2.0   0.6% bufio.NewWriter
     1.5   0.4%  96.3%      1.5   0.4% bufio.NewReaderSize
     1.5   0.4%  96.7%      1.5   0.4% github.com/mailgun/vulcand/Godeps/_workspace/src/github.com/mailgun/route.(*trieNode).parseExpression
     1.0   0.3%  97.0%      1.0   0.3% encoding/pem.Decode
     1.0   0.3%  97.3%      1.0   0.3% github.com/mailgun/vulcand/Godeps/_workspace/src/github.com/mailgun/oxy/memmetrics.NewCounter
     1.0   0.3%  97.6%      1.0   0.3% github.com/mailgun/vulcand/Godeps/_workspace/src/github.com/mailgun/oxy/roundrobin.NewRebalancer

From that, it looks like hdrhistogram is the culprit!

Jan 17 '15 03:01 porjo

Yep, that happened because on every reload vulcand did not clean up the histograms because the go routine kept a reference to them, so try the new version out to see if it's any better

Jan 17 '15 04:01 klizhentas

Ah, ok - I'll give it a try! Thanks

Jan 17 '15 04:01 porjo

Just FYI, my server has been running for 10hrs and the memory usage looks much better since the fix. Heap analysis is as follows:

(pprof) top 10
Total: 9.0 MB
     3.5  38.9%  38.9%      3.5  38.9% bufio.NewReaderSize
     3.0  33.3%  72.2%      3.0  33.3% github.com/mailgun/vulcand/Godeps/_workspace/src/github.com/codahale/hdrhistogram.New
     1.0  11.1%  83.3%      1.0  11.1% makemap_c
     0.5   5.6%  88.9%      0.5   5.6% allocg
     0.5   5.6%  94.4%      0.5   5.6% net/http.(*Transport).getIdleConnCh
     0.5   5.6% 100.0%      0.5   5.6% regexp/syntax.(*compiler).rune
     0.0   0.0% 100.0%      0.5   5.6% _rt0_go

Jan 18 '15 23:01 porjo

Thanks for the update!

Jan 19 '15 01:01 klizhentas

I've been running docker containers mailgun/vulcand:v0.8.0-beta.2 and most recently mailgun/vulcand:v0.8.0-beta.3. Still experiencing really bad memory leaks. After 24 hours, it's back up to about 4.5GiB ram.

CONTAINER           CPU %               MEM USAGE/LIMIT      MEM %               NET I/O
central_vulcand     0.70%               4.55 GiB/62.96 GiB   7.23%               1.675 GiB/1.986 GiB

Worse yet, yesterday when I killed the container and restarted it, all the frontends were unavailable. I had to re-announce all the services for them to take effect. I think Vulcand somehow corrupted the etcd entries.

Apr 16 '15 23:04 osterman

Erik,

I can release a patched version just to get goroutine dumps periodically, would you be able to install it and run for a a couple of hours?

In addition to that, are there any outstanding logs that Vulcand produces?

Thanks, Sasha

Apr 17 '15 00:04 klizhentas

@klizhentas sure! Any chance you could push it as a container to Docker Hub?

Fwiw, using Registrator+SkyDNS.

Here's how I announce

ExecStart=/usr/bin/docker exec central_vulcand /go/bin/vctl --vulcan 'http://localhost:8192' backend upsert -id ib_web

Here's how I set presence

ExecStart=/usr/bin/docker exec central_vulcand /go/bin/vctl --vulcan 'http://localhost:8192' server upsert -b ib_web -id ib_web_server -url http://web.ib.cloudposse.local:80

Here's how I set the vHost:

ExecStart=/usr/bin/docker exec central_vulcand /go/bin/vctl --vulcan 'http://localhost:8192' frontend upsert -id %i -b ib_web -route 'Host(`%i`) && PathRegexp(`.*`)'

Here's how I run vulcand:

ExecStart=/usr/bin/docker run \
                      --dns ${COREOS_PRIVATE_IPV4} \
                      --rm \
                      --name central_vulcand \
                      --memory=10g \
                      --cpu-shares=200 \
                      -e "SERVICE_NAME=central" \
                      -e "SERVICE_ID=vulcand" \
                      -p 80:80 \
                      mailgun/vulcand:v0.8.0-beta.3 \
                      /go/bin/vulcand -apiInterface=0.0.0.0 -interface=0.0.0.0 -etcd=http://${COREOS_PRIVATE_IPV4}:4001 -port=80 -apiPort=8192

I do see some errors like these... no idea if they are related.

Apr 17 00:12:36 core-01 docker[1226]: Apr 17 00:12:36.461: ERROR PID:1 [fwd.go:89] Error forwarding to http://web.ib.cloudposse.local:80, err: net/http: timeout awaiting response headers
Apr 17 00:13:27 core-01 docker[1226]: Apr 17 00:13:27.200: ERROR PID:1 [etcd.go:417] unexpected error: unexpected end of JSON input, stop watching
Apr 17 00:13:36 core-01 docker[1226]: Apr 17 00:13:36.669: ERROR PID:1 [fwd.go:89] Error forwarding to http://web.ib.cloudposse.local:80, err: net/http: timeout awaiting response headers
Apr 17 00:14:36 core-01 docker[1226]: Apr 17 00:14:36.671: ERROR PID:1 [fwd.go:89] Error forwarding to http://web.ib.cloudposse.local:80, err: net/http: timeout awaiting response headers
Apr 17 00:15:36 core-01 docker[1226]: Apr 17 00:15:36.832: ERROR PID:1 [fwd.go:89] Error forwarding to http://web.ib.cloudposse.local:80, err: net/http: timeout awaiting response headers
Apr 17 00:16:36 core-01 docker[1226]: Apr 17 00:16:36.833: ERROR PID:1 [fwd.go:89] Error forwarding to http://web.ib.cloudposse.local:80, err: net/http: timeout awaiting response headers
Apr 17 00:17:36 core-01 docker[1226]: Apr 17 00:17:36.928: ERROR PID:1 [fwd.go:89] Error forwarding to http://web.ib.cloudposse.local:80, err: net/http: timeout awaiting response headers
Apr 17 00:18:32 core-01 docker[1226]: Apr 17 00:18:32.256: ERROR PID:1 [etcd.go:417] unexpected error: unexpected end of JSON input, stop watching
Apr 17 00:18:36 core-01 docker[1226]: Apr 17 00:18:36.929: ERROR PID:1 [fwd.go:89] Error forwarding to http://web.ib.cloudposse.local:80, err: net/http: timeout awaiting response headers
Apr 17 00:19:37 core-01 docker[1226]: Apr 17 00:19:37.024: ERROR PID:1 [fwd.go:89] Error forwarding to http://web.ib.cloudposse.local:80, err: net/http: timeout awaiting response headers

Apr 17 '15 00:04 osterman

Sure, I will prepare the build on docker hub and will let you know, meanwhile I will reopen the bug

Apr 17 '15 00:04 klizhentas

Whats your etcd version?

Jun 08 '15 21:06 psi-4ward

I'm running CoreOS 647.0.0 which is pinned at etcd 0.4.9. Unfortunately, I had to cut-bait on vulcand a few weeks ago as I didn't see any updates. I'm now running Nginx+confd, so I won't be able to help diagnose these memory leak issues any longer.

Jun 09 '15 15:06 osterman

I just found I'm running into this as well, using v0.8.0-beta.3. I really like vulcand, and don't want to switch to confd. I'm running coreos 681.2.0, etcd version 0.4.9. Interestingly, my logs show that fleetd was having trouble connecting to etcd immediately before vulcand is killed. Could that be a symptom of the high RAM usage, or a cause of something happening wrong in vulcan?

Jun 24 15:20:26 coreos-01 fleetd[587]: ERROR engine.go:149: Unable to determine cluster engine version
Jun 24 15:20:27 coreos-01 fleetd[587]: INFO client.go:292: Failed getting response from http://localhost:4001/: cancelled
Jun 24 15:20:29 coreos-01 kernel: etcd invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
...
Jun 24 15:20:29 coreos-01 kernel: Out of memory: Kill process 27077 (vulcand) score 777 or sacrifice child
Jun 24 15:20:29 coreos-01 kernel: Killed process 27077 (vulcand) total-vm:589868kB, anon-rss:392300kB, file-rss:0kB

Jun 25 '15 11:06 crabmusket

Since I'm using etcd 2.1 on my CoreOS install I have no memory leak within vulcand anymore.

Jun 25 '15 12:06 tboerger

Hmm, might see if upgrading that helps.

Jun 26 '15 02:06 crabmusket