vulcand icon indicating copy to clipboard operation
vulcand copied to clipboard

out of memory

Open porjo opened this issue 11 years ago • 17 comments

I've been having some issues hitting hard RAM limit on my box (swap is disabled). I noticed this message in the output of journalctl:

Jan 15 21:14:07 bris01 kernel: Out of memory: Kill process 940 (vulcand) score 393 or sacrifice child
Jan 15 21:14:07 bris01 kernel: Killed process 940 (vulcand) total-vm:415440kB, anon-rss:401040kB, file-rss:0kB

RSS of ~400MB seems excessive. This instance only receives very light traffic, so I'm guessing perhaps there is a memory leak? Sorry I don't have any other data to back that up at this stage.

porjo avatar Jan 15 '15 21:01 porjo

I've just re-built vulcand from latest commit, just to be sure I'm not hitting an old bug. I'll continue to monitor.

porjo avatar Jan 15 '15 21:01 porjo

Wow, that's a serious one. I'll spend some time digging into it, thanks for reporting!

klizhentas avatar Jan 15 '15 23:01 klizhentas

I think I've found it. Anyways, try it (v0.8.0-beta.2) out and if you sill experience it, feel free to reopen.

klizhentas avatar Jan 16 '15 22:01 klizhentas

@klizhentas I haven't tried v0.8.0-beta.2 yet, however my server has been up for the last 16hrs running pprof. I just ran a heap analysis:

(pprof) top 10
Total: 350.0 MB
   324.0  92.6%  92.6%    324.0  92.6% github.com/mailgun/vulcand/Godeps/_workspace/src/github.com/codahale/hdrhistogram.New
     4.0   1.1%  93.7%    328.0  93.7% github.com/mailgun/vulcand/Godeps/_workspace/src/github.com/mailgun/oxy/memmetrics.NewHDRHistogram
     3.5   1.0%  94.7%      3.5   1.0% encoding/json.(*decodeState).literalStore
     2.0   0.6%  95.3%      2.0   0.6% allocg
     2.0   0.6%  95.9%      2.0   0.6% bufio.NewWriter
     1.5   0.4%  96.3%      1.5   0.4% bufio.NewReaderSize
     1.5   0.4%  96.7%      1.5   0.4% github.com/mailgun/vulcand/Godeps/_workspace/src/github.com/mailgun/route.(*trieNode).parseExpression
     1.0   0.3%  97.0%      1.0   0.3% encoding/pem.Decode
     1.0   0.3%  97.3%      1.0   0.3% github.com/mailgun/vulcand/Godeps/_workspace/src/github.com/mailgun/oxy/memmetrics.NewCounter
     1.0   0.3%  97.6%      1.0   0.3% github.com/mailgun/vulcand/Godeps/_workspace/src/github.com/mailgun/oxy/roundrobin.NewRebalancer

From that, it looks like hdrhistogram is the culprit!

porjo avatar Jan 17 '15 03:01 porjo

Yep, that happened because on every reload vulcand did not clean up the histograms because the go routine kept a reference to them, so try the new version out to see if it's any better

klizhentas avatar Jan 17 '15 04:01 klizhentas

Ah, ok - I'll give it a try! Thanks

porjo avatar Jan 17 '15 04:01 porjo

Just FYI, my server has been running for 10hrs and the memory usage looks much better since the fix. Heap analysis is as follows:

(pprof) top 10
Total: 9.0 MB
     3.5  38.9%  38.9%      3.5  38.9% bufio.NewReaderSize
     3.0  33.3%  72.2%      3.0  33.3% github.com/mailgun/vulcand/Godeps/_workspace/src/github.com/codahale/hdrhistogram.New
     1.0  11.1%  83.3%      1.0  11.1% makemap_c
     0.5   5.6%  88.9%      0.5   5.6% allocg
     0.5   5.6%  94.4%      0.5   5.6% net/http.(*Transport).getIdleConnCh
     0.5   5.6% 100.0%      0.5   5.6% regexp/syntax.(*compiler).rune
     0.0   0.0% 100.0%      0.5   5.6% _rt0_go

porjo avatar Jan 18 '15 23:01 porjo

Thanks for the update!

klizhentas avatar Jan 19 '15 01:01 klizhentas

I've been running docker containers mailgun/vulcand:v0.8.0-beta.2 and most recently mailgun/vulcand:v0.8.0-beta.3. Still experiencing really bad memory leaks. After 24 hours, it's back up to about 4.5GiB ram.

CONTAINER           CPU %               MEM USAGE/LIMIT      MEM %               NET I/O
central_vulcand     0.70%               4.55 GiB/62.96 GiB   7.23%               1.675 GiB/1.986 GiB

Worse yet, yesterday when I killed the container and restarted it, all the frontends were unavailable. I had to re-announce all the services for them to take effect. I think Vulcand somehow corrupted the etcd entries.

osterman avatar Apr 16 '15 23:04 osterman

Erik,

I can release a patched version just to get goroutine dumps periodically, would you be able to install it and run for a a couple of hours?

In addition to that, are there any outstanding logs that Vulcand produces?

Thanks, Sasha

klizhentas avatar Apr 17 '15 00:04 klizhentas

@klizhentas sure! Any chance you could push it as a container to Docker Hub?

Fwiw, using Registrator+SkyDNS.

Here's how I announce

ExecStart=/usr/bin/docker exec central_vulcand /go/bin/vctl --vulcan 'http://localhost:8192' backend upsert -id ib_web

Here's how I set presence

ExecStart=/usr/bin/docker exec central_vulcand /go/bin/vctl --vulcan 'http://localhost:8192' server upsert -b ib_web -id ib_web_server -url http://web.ib.cloudposse.local:80

Here's how I set the vHost:

ExecStart=/usr/bin/docker exec central_vulcand /go/bin/vctl --vulcan 'http://localhost:8192' frontend upsert -id %i -b ib_web -route 'Host(`%i`) && PathRegexp(`.*`)'

Here's how I run vulcand:

ExecStart=/usr/bin/docker run \
                      --dns ${COREOS_PRIVATE_IPV4} \
                      --rm \
                      --name central_vulcand \
                      --memory=10g \
                      --cpu-shares=200 \
                      -e "SERVICE_NAME=central" \
                      -e "SERVICE_ID=vulcand" \
                      -p 80:80 \
                      mailgun/vulcand:v0.8.0-beta.3 \
                      /go/bin/vulcand -apiInterface=0.0.0.0 -interface=0.0.0.0 -etcd=http://${COREOS_PRIVATE_IPV4}:4001 -port=80 -apiPort=8192

I do see some errors like these... no idea if they are related.

Apr 17 00:12:36 core-01 docker[1226]: Apr 17 00:12:36.461: ERROR PID:1 [fwd.go:89] Error forwarding to http://web.ib.cloudposse.local:80, err: net/http: timeout awaiting response headers
Apr 17 00:13:27 core-01 docker[1226]: Apr 17 00:13:27.200: ERROR PID:1 [etcd.go:417] unexpected error: unexpected end of JSON input, stop watching
Apr 17 00:13:36 core-01 docker[1226]: Apr 17 00:13:36.669: ERROR PID:1 [fwd.go:89] Error forwarding to http://web.ib.cloudposse.local:80, err: net/http: timeout awaiting response headers
Apr 17 00:14:36 core-01 docker[1226]: Apr 17 00:14:36.671: ERROR PID:1 [fwd.go:89] Error forwarding to http://web.ib.cloudposse.local:80, err: net/http: timeout awaiting response headers
Apr 17 00:15:36 core-01 docker[1226]: Apr 17 00:15:36.832: ERROR PID:1 [fwd.go:89] Error forwarding to http://web.ib.cloudposse.local:80, err: net/http: timeout awaiting response headers
Apr 17 00:16:36 core-01 docker[1226]: Apr 17 00:16:36.833: ERROR PID:1 [fwd.go:89] Error forwarding to http://web.ib.cloudposse.local:80, err: net/http: timeout awaiting response headers
Apr 17 00:17:36 core-01 docker[1226]: Apr 17 00:17:36.928: ERROR PID:1 [fwd.go:89] Error forwarding to http://web.ib.cloudposse.local:80, err: net/http: timeout awaiting response headers
Apr 17 00:18:32 core-01 docker[1226]: Apr 17 00:18:32.256: ERROR PID:1 [etcd.go:417] unexpected error: unexpected end of JSON input, stop watching
Apr 17 00:18:36 core-01 docker[1226]: Apr 17 00:18:36.929: ERROR PID:1 [fwd.go:89] Error forwarding to http://web.ib.cloudposse.local:80, err: net/http: timeout awaiting response headers
Apr 17 00:19:37 core-01 docker[1226]: Apr 17 00:19:37.024: ERROR PID:1 [fwd.go:89] Error forwarding to http://web.ib.cloudposse.local:80, err: net/http: timeout awaiting response headers

osterman avatar Apr 17 '15 00:04 osterman

Sure, I will prepare the build on docker hub and will let you know, meanwhile I will reopen the bug

klizhentas avatar Apr 17 '15 00:04 klizhentas

Whats your etcd version?

psi-4ward avatar Jun 08 '15 21:06 psi-4ward

I'm running CoreOS 647.0.0 which is pinned at etcd 0.4.9. Unfortunately, I had to cut-bait on vulcand a few weeks ago as I didn't see any updates. I'm now running Nginx+confd, so I won't be able to help diagnose these memory leak issues any longer.

osterman avatar Jun 09 '15 15:06 osterman

I just found I'm running into this as well, using v0.8.0-beta.3. I really like vulcand, and don't want to switch to confd. I'm running coreos 681.2.0, etcd version 0.4.9. Interestingly, my logs show that fleetd was having trouble connecting to etcd immediately before vulcand is killed. Could that be a symptom of the high RAM usage, or a cause of something happening wrong in vulcan?

Jun 24 15:20:26 coreos-01 fleetd[587]: ERROR engine.go:149: Unable to determine cluster engine version
Jun 24 15:20:27 coreos-01 fleetd[587]: INFO client.go:292: Failed getting response from http://localhost:4001/: cancelled
Jun 24 15:20:29 coreos-01 kernel: etcd invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
...
Jun 24 15:20:29 coreos-01 kernel: Out of memory: Kill process 27077 (vulcand) score 777 or sacrifice child
Jun 24 15:20:29 coreos-01 kernel: Killed process 27077 (vulcand) total-vm:589868kB, anon-rss:392300kB, file-rss:0kB

crabmusket avatar Jun 25 '15 11:06 crabmusket

Since I'm using etcd 2.1 on my CoreOS install I have no memory leak within vulcand anymore.

tboerger avatar Jun 25 '15 12:06 tboerger

Hmm, might see if upgrading that helps.

crabmusket avatar Jun 26 '15 02:06 crabmusket