go-carbon icon indicating copy to clipboard operation
go-carbon copied to clipboard

Carbonserver panic and Out of Memory

Open claudio-benfatto opened this issue 8 years ago • 15 comments

Hello here,

we're running a graphite production cluster based on a carbon-c-relay frontend and go-carbon as a backend.

The cluster is composed by 3 go-carbon nodes running on well-equipped servers (128GB of memory and a bunch of 8 core cpus).

We've recently upgraded to go-carbon version 0.10.0. Futhermore we're currently running a nightly cron task to clean up the stale wsp files.

The problem we've experienced today was an out of memory, followed by the OOM intervention on all the 3 nodes, presumably caused by the carbonserver component. What we've seen was the OOM intervention:

[20944265.175516] Out of memory: Kill process 30346 (go-carbon) score 726 or sacrifice child
[20944265.175782] Killed process 30346 (go-carbon) total-vm:115783724kB, anon-rss:99334204kB, file-rss:0kB
[20946896.158156] perf samples too long (10010 > 10000), lowering kernel.perf_event_max_sample_rate to 12500
[21361096.704491] TCP: TCP: Possible SYN flooding on port 2003. Sending cookies.  Check SNMP counters.

associated with the following error stacktrace in the go-carbon logs:

[2017-07-06T10:45:33.416Z] ERROR [carbonserver] panic encountered {"handler": "fileListUpdated", "stack": "github.com/lomik/go-carbon/vendor/go.uber.org/zap.Stack\n\t/home/travis/gopath/src/github.com/lomik/go-carbon/_vendor/src/github.com/lomik/go-carbon/vendor/go.uber.org/zap/field.go:209\ngithub.com/lomik/go-carbon/carbonserver.(*CarbonserverListener).updateFileList.func1\n\t/home/travis/gopath/src/github.com/lomik/go-carbon/_vendor/src/github.com/lomik/go-carbon/carbonserver/carbonserver.go:340\nruntime.call32\n\t/home/travis/.gimme/versions/go1.8.1.linux.amd64/src/runtime/asm_amd64.s:514\nruntime.gopanic\n\t/home/travis/.gimme/versions/go1.8.1.linux.amd64/src/runtime/panic.go:489\nruntime.panicmem\n\t/home/travis/.gimme/versions/go1.8.1.linux.amd64/src/runtime/panic.go:63\nruntime.sigpanic\n\t/home/travis/.gimme/versions/go1.8.1.linux.amd64/src/runtime/signal_unix.go:290\ngithub.com/lomik/go-carbon/carbonserver.(*CarbonserverListener).updateFileList\n\t/home/travis/gopath/src/github.com/lomik/go-carbon/_vendor/src/github.com/lomik/go-carbon/carbonserver/carbonserver.go:418\ngithub.com/lomik/go-carbon/carbonserver.(*CarbonserverListener).fileListUpdater\n\t/home/travis/gopath/src/github.com/lomik/go-carbon/_vendor/src/github.com/lomik/go-carbon/carbonserver/carbonserver.go:331", "error": "runtime error: invalid memory address or nil pointer dereference"}
[2017-07-06T10:46:02.420Z] ERROR [persister] failed to open whisper file {"path": "/data/graphite/whisper/sprcom/servers/qa/rabbitmq_mothership-qa/queues-MOCacheInvalidationTopic_sgw_SI2_/r
eturn.wsp", "error": "open /data/graphite/whisper/sprcom/servers/qa/rabbitmq_mothership-qa/queues-MOCacheInvalidationTopic_sgw_SI2_/return.wsp: too many open files"}
[2017-07-06T10:46:02.420Z] ERROR [persister] failed to open whisper file {"path": "/data/graphite/whisper/servers/sl-live-aim-03/irq/irq-72.wsp", "error": "open /data/graphite/whisper/serve
rs/sl-live-aim-03/irq/irq-72.wsp: too many open files"}
[2017-07-06T10:46:02.420Z] ERROR [persister] failed to open whisper file {"path": "/data/graphite/whisper/servers/sl-live-rone-ml-master-11_springer-sbm_com/irq/irq-108.wsp", "error": "open
 /data/graphite/whisper/servers/sl-live-rone-ml-master-11_springer-sbm_com/irq/irq-108.wsp: too many open files"}
[2017-07-06T10:46:02.420Z] ERROR [persister] failed to open whisper file {"path": "/data/graphite/whisper/servers/sl-live-static-content-02/df-a05/df_complex-reserved.wsp", "error": "open /
data/graphite/whisper/servers/sl-live-static-content-02/df-a05/df_complex-reserved.wsp: too many open files"}

From our metrics dashboards (which include an external Datadog) we could not see the memory usage spike, and we think it actually happened extremely sudden.

The only evident oddity is, maybe, a visible spike in the number of goroutines just before the processes crashed: carbon_go_routines

Looking at the source, on the (seemingly) faulty path, we also noticed that the Unlock call to the mutex here is not deferred.

Any help or support on this would be really appreciated :)!

Many thanks, Claudio Benfatto

claudio-benfatto avatar Jul 06 '17 16:07 claudio-benfatto

I tried and managed to reproduce the bug:

I initially set the scan-frequency carbonserver parameter to a relatively small value:

scan-frequency = "10s"
  • then I sent 5 metrics to graphite:
for i in $(seq 1 5);do echo "test.metric.$i 1 `date +%s`" | nc -w 1 localhost 20030; done;
  • accessed and then deleted them:
curl -X GET "http://localhost:10000/render?format=json&target=test.metric.*"
rm -rf /data/graphite/whisper/test/metric/*
  • sent them again:
for i in $(seq 1 5);do echo "test.metric.$i 1 `date +%s`" | nc -w 1 localhost 20030; done;
  • At this point I see a panic stacktrace in the go-carbon log file:
[2017-07-07T09:44:30.779Z] ERROR [carbonserver] panic encountered {"handler": "fileListUpdated", "stack": "github.com/lomik/go-carbon/vendor/go.uber.org/zap.Stack\n\t/home/travis/gopath/src/github.com/lomik/go-carbon/_vendor/src/github.com/lomik/go-carbon/vendor/go.uber.org/zap/field.go:209\ngithub.com/lomik/go-carbon/carbonserver.(*CarbonserverListener).updateFileList.func1\n\t/home/travis/gopath/src/github.com/lomik/go-carbon/_vendor/src/github.com/lomik/go-carbon/carbonserver/carbonserver.go:340\nruntime.call32\n\t/home/travis/.gimme/versions/go1.8.1.linux.amd64/src/runtime/asm_amd64.s:514\nruntime.gopanic\n\t/home/travis/.gimme/versions/go1.8.1.linux.amd64/src/runtime/panic.go:489\nruntime.panicmem\n\t/home/travis/.gimme/versions/go1.8.1.linux.amd64/src/runtime/panic.go:63\nruntime.sigpanic\n\t/home/travis/.gimme/versions/go1.8.1.linux.amd64/src/runtime/signal_unix.go:290\ngithub.com/lomik/go-carbon/carbonserver.(*CarbonserverListener).updateFileList\n\t/home/travis/gopath/src/github.com/lomik/go-carbon/_vendor/src/github.com/lomik/go-carbon/carbonserver/carbonserver.go:418\ngithub.com/lomik/go-carbon/carbonserver.(*CarbonserverListener).fileListUpdater\n\t/home/travis/gopath/src/github.com/lomik/go-carbon/_vendor/src/github.com/lomik/go-carbon/carbonserver/carbonserver.go:331", "error": "runtime error: invalid memory address or nil pointer dereference"}
  • and if I render the metrics again:
curl -X GET "http://localhost:10000/render?format=json&target=test.metric.*"
  • this is what I get:
root@graphite2:~# lsof -p 31809 | grep CLOSE_WAIT
go-carbon 31809 root   11u  IPv6 23789142        0t0      TCP graphite2.domain.com:http-alt->graphite2.domain.com:43314 (CLOSE_WAIT)
go-carbon 31809 root   14u  IPv6 23789147        0t0      TCP graphite2.domain.com:http-alt->graphite2.domain.com:43315 (CLOSE_WAIT)
go-carbon 31809 root   15u  IPv6 23789152        0t0      TCP graphite2.domain.com:http-alt->graphite2.domain.com:43316 (CLOSE_WAIT)
go-carbon 31809 root   16u  IPv6 23789157        0t0      TCP graphite2.domain.com:http-alt->graphite2.domain.com:43317 (CLOSE_WAIT)
go-carbon 31809 root   17u  IPv6 23789162        0t0      TCP graphite2.domain.com:http-alt->graphite2.domain.com:43318 (CLOSE_WAIT)
  • done it 2 more times:
root@graphite2:~# lsof -p 31809 | grep CLOSE_WAIT | wc -l
10

[]root@graphite2:~lsof -p 31809 | grep CLOSE_WAIT | wc -l*"
15
  • and finally after enabling pprof and looking at the active goroutines:
goroutine profile: total 52
15 @ 0x42e25a 0x42e33e 0x43ebb1 0x43e894 0x46e2ad 0x46e4a4 0x6f94e8 0x6f83c4 0x6f85c6 0x814654 0x7e8bad 0x7eb926 0x6e5c70 0x45b381
#   0x43e893    sync.runtime_SemacquireMutex+0x33                           /home/travis/.gimme/versions/go1.8.1.linux.amd64/src/runtime/sema.go:62
#   0x46e2ac    sync.(*Mutex).Lock+0x9c                                 /home/travis/.gimme/versions/go1.8.1.linux.amd64/src/sync/mutex.go:87
#   0x46e4a3    sync.(*Once).Do+0x43                                    /home/travis/.gimme/versions/go1.8.1.linux.amd64/src/sync/once.go:40
#   0x6f94e7    github.com/lomik/go-carbon/cache.(*WriteoutQueue).makeRebuildCallback.func1+0xa7    /home/travis/gopath/src/github.com/lomik/go-carbon/_vendor/src/github.com/lomik/go-carbon/cache/writeout_queue.go:62
#   0x6f83c3    github.com/lomik/go-carbon/cache.(*WriteoutQueue).get+0x1a3             /home/travis/gopath/src/github.com/lomik/go-carbon/_vendor/src/github.com/lomik/go-carbon/cache/writeout_queue.go:101
#   0x6f85c5    github.com/lomik/go-carbon/cache.(*WriteoutQueue).GetNotConfirmed+0x65          /home/travis/gopath/src/github.com/lomik/go-carbon/_vendor/src/github.com/lomik/go-carbon/cache/writeout_queue.go:117
#   0x814653    github.com/lomik/go-carbon/cache.(*WriteoutQueue).GetNotConfirmed-fm+0x33       /home/travis/gopath/src/github.com/lomik/go-carbon/_vendor/src/github.com/lomik/go-carbon/carbon/app.go:220
#   0x7e8bac    github.com/lomik/go-carbon/persister.(*Whisper).worker+0xec             /home/travis/gopath/src/github.com/lomik/go-carbon/_vendor/src/github.com/lomik/go-carbon/persister/whisper.go:218
#   0x7eb925    github.com/lomik/go-carbon/persister.(*Whisper).Start.func1.1+0x45          /home/travis/gopath/src/github.com/lomik/go-carbon/_vendor/src/github.com/lomik/go-carbon/persister/whisper.go:272
#   0x6e5c6f    github.com/lomik/go-carbon/helper.(*Stoppable).StartFunc.func1.1+0x2f           /home/travis/gopath/src/github.com/lomik/go-carbon/_vendor/src/github.com/lomik/go-carbon/helper/stoppable.go:42
15 @ 0x42e25a 0x42e33e 0x43ebb1 0x43e894 0x46e2ad 0x7c9fbb 0x7d6a00 0x7e1948 0x66e834 0x7e00e2 0x783a8d 0x783bd0 0x66e834 0x66fc70 0x77dadd 0x66e834 0x6710f2 0x66d3a2 0x45b381
#   0x43e893    sync.runtime_SemacquireMutex+0x33                                               /home/travis/.gimme/versions/go1.8.1.linux.amd64/src/runtime/sema.go:62
#   0x46e2ac    sync.(*Mutex).Lock+0x9c                                                     /home/travis/.gimme/versions/go1.8.1.linux.amd64/src/sync/mutex.go:87
#   0x7c9fba    github.com/lomik/go-carbon/carbonserver.(*CarbonserverListener).UpdateMetricsAccessTimes+0x7a                   /home/travis/gopath/src/github.com/lomik/go-carbon/_vendor/src/github.com/lomik/go-carbon/carbonserver/carbonserver.go:309
#   0x7d69ff    github.com/lomik/go-carbon/carbonserver.(*CarbonserverListener).renderHandler+0x1d7f                        /home/travis/gopath/src/github.com/lomik/go-carbon/_vendor/src/github.com/lomik/go-carbon/carbonserver/carbonserver.go:1041
#   0x7e1947    github.com/lomik/go-carbon/carbonserver.(*CarbonserverListener).(github.com/lomik/go-carbon/carbonserver.renderHandler)-fm+0x47 /home/travis/gopath/src/github.com/lomik/go-carbon/_vendor/src/github.com/lomik/go-carbon/carbonserver/carbonserver.go:1547
#   0x66e833    net/http.HandlerFunc.ServeHTTP+0x43                                             /home/travis/.gimme/versions/go1.8.1.linux.amd64/src/net/http/server.go:1942
#   0x7e00e1    github.com/lomik/go-carbon/carbonserver.TraceHandler.func1+0x251                                /home/travis/gopath/src/github.com/lomik/go-carbon/_vendor/src/github.com/lomik/go-carbon/carbonserver/carbonserver.go:140
#   0x783a8c    github.com/lomik/go-carbon/vendor/github.com/dgryski/httputil.TimeHandler.func1+0x7c                        /home/travis/gopath/src/github.com/lomik/go-carbon/_vendor/src/github.com/lomik/go-carbon/vendor/github.com/dgryski/httputil/times.go:26
#   0x783bcf    github.com/lomik/go-carbon/vendor/github.com/dgryski/httputil.TrackConnections.func1+0xef                   /home/travis/gopath/src/github.com/lomik/go-carbon/_vendor/src/github.com/lomik/go-carbon/vendor/github.com/dgryski/httputil/track.go:40
#   0x66e833    net/http.HandlerFunc.ServeHTTP+0x43                                             /home/travis/.gimme/versions/go1.8.1.linux.amd64/src/net/http/server.go:1942
#   0x66fc6f    net/http.(*ServeMux).ServeHTTP+0x12f                                                /home/travis/.gimme/versions/go1.8.1.linux.amd64/src/net/http/server.go:2238
#   0x77dadc    github.com/lomik/go-carbon/vendor/github.com/NYTimes/gziphandler.NewGzipLevelHandler.func1.1+0x15c              /home/travis/gopath/src/github.com/lomik/go-carbon/_vendor/src/github.com/lomik/go-carbon/vendor/github.com/NYTimes/gziphandler/gzip.go:167
#   0x66e833    net/http.HandlerFunc.ServeHTTP+0x43                                             /home/travis/.gimme/versions/go1.8.1.linux.amd64/src/net/http/server.go:1942
#   0x6710f1    net/http.serverHandler.ServeHTTP+0x91                                               /home/travis/.gimme/versions/go1.8.1.linux.amd64/src/net/http/server.go:2568
#   0x66d3a1    net/http.(*conn).serve+0x611                                                    /home/travis/.gimme/versions/go1.8.1.linux.amd64/src/net/http/server.go:1825

I hope this could help fixing the issue ;)

claudio-benfatto avatar Jul 07 '17 13:07 claudio-benfatto

Managed to reproduce this with 0.10.1 too (same procedure)

claudio-benfatto avatar Jul 07 '17 13:07 claudio-benfatto

I tried this change: https://gist.github.com/claudio-benfatto/a443ae2452b97ca6d661fe8c81936248 and it helped fixing the bug caused by the panic in combination with the unreleased lock.

While this would mitigate the effects of the panic itself, it does not solve its root cause, which is to attribute, I guess, to some, not synchronized memory references. The references seem to be updated by a subsequently render operation only

claudio-benfatto avatar Jul 07 '17 14:07 claudio-benfatto

Thanks for the detailed report! @Civil, FYI

lomik avatar Jul 07 '17 18:07 lomik

I'll have a look at that monday or so.

Civil avatar Jul 07 '17 19:07 Civil

Can you please check (and post trace) if you can reproduce that with master branch?

https://github.com/lomik/go-carbon/commit/c539e3466bf3124b56ce548aeb7bd5509b25cffa - this commit changed the logic quite a bit and honestly I can't reproduce that even on 0.10.1

Civil avatar Jul 10 '17 15:07 Civil

hello, @Civil I confirm that it doesn't seem to happen when using the master branch.

It definitely happens in v0.10.1, and I think this is expected given that the logic didn't really change much in respect to v0.10.0.

The reason why this happens is that at this point: https://github.com/lomik/go-carbon/blob/v0.10.1/carbonserver/carbonserver.go#L410-L421

	if fidx != nil {
		fidx.Lock()
		for m := range fidx.accessedMetrics {
			accessedMetrics[m] = struct{}{}
		}

		for m := range fidx.accessedMetrics {
			if d, ok := details[m]; ok {
				d.RdTime = fidx.details[m].RdTime
			}
		}
		fidx.Unlock()
	}

if you follow the steps I posted, the details dictionary will contain the re-added metrics, while the fidx won't.

And this is because the fidx index was updated during the previous iteration: https://github.com/lomik/go-carbon/blob/v0.10.1/carbonserver/carbonserver.go#L425-L432 when the files were still absent because deleted.

So a panic is generated here: https://github.com/lomik/go-carbon/blob/v0.10.1/carbonserver/carbonserver.go#L418 when you access fidx.details[m].

Anyhow, imho a defer for that lock would be good anyhow, defensively speaking, as the cost of introducing a bug in that area of code, would most likely result in crashing the servers (and the whole cluster).

claudio-benfatto avatar Jul 10 '17 21:07 claudio-benfatto

In case of panic carbonserver should actually stop working, that's saner than trying to run despite of that.

And again if the panic happens there, then the bug is somewhere in how accessedMetrics is populated, because it should contain only keys that are present in fidx.details.

As about this lock - in current code state it shouldn't be deferred, because it might lead to double unlock in case of failure in some other parts of this function. It might though if I'll move that loop into it's own function. Not sure that it's worth it.

Civil avatar Jul 10 '17 22:07 Civil

I have only a superficial knowledge of the carbonserver code, therefore I'm not going to discuss further the points you highlighted.

However, I want to remark that a panic in that section, given that the goroutines are not bounded, would bring the entire server to resource starvation, and this means that both go-carbon and carbonserver (which are co-located), would stop working (breaking not only the carbonapi api side but also the metric ingestion).

claudio-benfatto avatar Jul 10 '17 22:07 claudio-benfatto

Hi @Civil, I am a colleague of Claudio. About this issue, one of the collateral effects we saw was the persister threads of go-carbon stop working because it runs out of FD (our hard limit is 4096, which was more than enough until we found this issue):

[2017-07-06T10:46:02.421Z] ERROR [persister] failed to open whisper file {"path": "/data/graphite/whisper/sprcom/servers/qa/rabbitmq_mothership-qa/queues-EBooksListenerMDB/rabbitmq_details-messagesrate.wsp", "error": "open /data/graphite/whisper/sprcom/servers/qa/rabbitmq_mothership-qa/queues-EBooksListenerMDB/rabbitmq_details-messages/rate.wsp: too many open files"}

The second effect is go-carbon process does do not die and it keeps throwing those messages in the logs each time it cannot open a file (also filling /var).

The point is, even if the carbonserver panics and stops working cleanly, there is still running the persister part, so, the process is not working properly: it is not doing what it should do and the only way to recover it is restarting the process. In my opinion, your approach of let the carbonserver panics is fine if it runs on its own process and then it crashes and exits (with error code). But, as the persister part (go-carbon) is also running on the same process now both parts of the service (persister and "API" are affected.

jriguera avatar Jul 11 '17 08:07 jriguera

We have panic issue with enabled carbonserver as well.

maggnus avatar Jul 11 '17 10:07 maggnus

@maggnus the same questions:

  1. Is it still happening with go-carbon from master branch?
  2. If it happens - what's the error message and a stack trace for the panic?

@jriguera well, that's I think very good question how to behave in such situations. My opinion is that we should shutdown go-carbon in that case. It will be impossible to reliably and correctly handle all the cases.

For example in that case we end up having fidx in some unknown state which I don't think is possible to reliably recover.

Civil avatar Jul 11 '17 11:07 Civil

@Civil

  1. yes
# /opt/go-carbon/bin/go-carbon --version
0.10.1
  1. I did not see any stack trace messages, just OOM Killer notification
Out of memory: Kill process 18913 (go-carbon) score 266 or sacrifice child
  18.18%  go-carbon  go-carbon          [.] runtime.greyobject
  11.47%  go-carbon  go-carbon          [.] runtime.heapBitsForObject
   7.95%  go-carbon  go-carbon          [.] strings.genSplit
   5.59%  go-carbon  go-carbon          [.] runtime.mallocgc
   4.99%  go-carbon  [vdso]             [.] __vdso_clock_gettime
   4.27%  go-carbon  go-carbon          [.] runtime.scanobject
   3.67%  go-carbon  go-carbon          [.] runtime.mapaccess2_faststr
   2.81%  go-carbon  go-carbon          [.] github.com/lomik/go-carbon/cache.fnv32
   1.87%  go-carbon  go-carbon          [.] sync.(*RWMutex).Unlock
   1.81%  go-carbon  go-carbon          [.] runtime.memeqbody
   1.51%  go-carbon  go-carbon          [.] runtime.memmove
   1.48%  go-carbon  go-carbon          [.] runtime.memclrNoHeapPointers
   1.43%  go-carbon  go-carbon          [.] strconv.readFloat
   1.36%  go-carbon  go-carbon          [.] runtime.heapBitsSetType
   1.20%  go-carbon  go-carbon          [.] runtime.unlock
   1.15%  go-carbon  go-carbon          [.] github.com/lomik/go-carbon/cache.(*Cache).Add
   0.92%  go-carbon  go-carbon          [.] runtime.indexbytebody
   0.91%  go-carbon  go-carbon          [.] runtime.lock

maggnus avatar Jul 11 '17 16:07 maggnus

@Civil Looks like trigram-index = false help us.

maggnus avatar Jul 13 '17 03:07 maggnus

@maggnus still seems to be first of all - separate issue, and secondly - that's still not enough info. What kind of HW/VM do you have? What's the amount of metrics? What's the config file? etc.

Civil avatar Jul 13 '17 07:07 Civil