carbon-relay-ng icon indicating copy to clipboard operation
carbon-relay-ng copied to clipboard

Leaks memory

Open mmimica opened this issue 10 years ago • 28 comments

Relay seems to be leaking memory. It consumes all 2GB of RAM we gave him until OOMK kicks in. Routing rules are pretty simple:

init = [
     "addBlack .net",
     "addRoute sendAllMatch core prefix=org.company  12.56.176.41:2014 spool=false pickle=true  12.56.176.42:2014 spool=false pickle=true",
     "addRoute sendAllMatch other prefix=stats  12.56.176.51:2014 spool=false pickle=true  12.56.176.52:2014 spool=false pickle=true"
]

mmimica avatar Dec 19 '14 09:12 mmimica

I'm also seeing this issue.

instance = "default" listen_addr = "0.0.0.0:2103" admin_addr = "0.0.0.0:2104" http_addr = "0.0.0.0:8089"

spool_dir = "/log/graphite/spool" log_level = "error"

init = [ "addRoute sendAllMatch carbonrelay-cluster 10.198.34.168:2013 spool=false pickle=false 10.198.34.169:2013 spool=false pickle=false 10.198.35.228:2013 spool=false pickle=false 10.199.223.31:10005 spool=false pickle=false" ]

[instrumentation] graphite_addr = "localhost:2103" graphite_interval = 15000

Slowly eats memory over time.

ghost avatar Jan 22 '15 17:01 ghost

It appears that my configuration was having lots of dest metric drops. This was tracked down to my last destination (10005) which was an influxdb destination that was having issues. When looking at the carbon-relay-ng metrics, it showed lots of errors and drops. Once I removed my influxdb destination, we've been stable. I think this isn't an issue with influxdb being a destination - instead it seems that its an issue with the performance of my influxdb backend. The back-up in processing seems to have been negatively impacting carbon-relay-ng.

ghost avatar Feb 10 '15 17:02 ghost

Any news on this?

willejs avatar Feb 24 '15 23:02 willejs

weird. i have a 2GB machine and never even needed to look at memory. never been an issue for me. any of you interested in doing some memory profiling? this can tell you where a lot of memory is being allocated. see http://blog.golang.org/profiling-go-programs

Dieterbe avatar Mar 30 '15 20:03 Dieterbe

I'm also getting oom kills. I'm also pushing to influxdb. Here's a grafana graph of memory usage (sorry for the missing units the graph is ~2 weeks wide and 7.5gb tall. I can try and do some memory profiling soon.

memory

shanemhansen avatar May 04 '15 16:05 shanemhansen

what's the green vs the yellow line? (which is memory and what's the other?)

Dieterbe avatar May 04 '15 17:05 Dieterbe

Sorry. They are both memory. The yellow line is memory usage on a different server running carbon-relay-ng at a lower throughput.

The green line does represent total memory usage on a server running carbon-relay-ng and some other processes, but that memory usage is normally pretty constant, I think the graph shows memory growth up to the point the kernel oom kills the process. I also have oom messages in dmesg.

I have a core file I'm analyzing too.

shanemhansen avatar May 04 '15 17:05 shanemhansen

My core file doesn't show anything particularly interesting. Approximately 90% of the core file is NULLs, which would make sense if we had lots of structures initialized to their zero value and ever used. I'll make a build with some memory allocation debugging next.

shanemhansen avatar May 04 '15 20:05 shanemhansen

I'm seeing the same issue. Here's a memory graph of two servers, both running carbon-relay-ng with an identical config. carbon-relay-ng was installed between May 28-29. Decreases in cached memory are server reboots. carbon-relay-ng-leak

The difference between both servers is that the first one is getting all the incoming carbon-relay-ng traffic right now.

The config is pretty trivial. This is running on 3 servers, each also running carbon-cache. Clients can talk to carbon-relay-ng on any of the 3 servers, which then forwards the metrics to all 3.

instance = "default"
listen_addr = "192.168.80.11:2103"
admin_addr = "127.0.0.1:2004"
http_addr = "127.0.0.1:8081"
spool_dir = "/var/spool/carbon-relay-ng"
log_level = "notice"
bad_metrics_max_age = "24h"

init = [
     'addBlack collectd.localhost',
     'addRoute sendAllMatch mon-carbons  192.168.80.11:2004 spool=true pickle=true  192.168.80.12:2004 spool=true pickle=true  192.168.80.13:2004 spool=true pickle=true',
]

[instrumentation]
graphite_addr = ""
graphite_interval = 1000

It hasn't OOMed yet, though it looks like it will in a few days at this rate. I'll give a shot at profiling it.

marcan avatar Jun 02 '15 06:06 marcan

Well, looks like carbon-relay-ng already imports pprof, so that was easy.

(pprof) top10
5.41GB of 5.45GB total (99.16%)
Dropped 109 nodes (cum <= 0.03GB)
      flat  flat%   sum%        cum   cum%
    5.41GB 99.16% 99.16%     5.41GB 99.16%  github.com/graphite-ng/carbon-relay-ng/_third_party/github.com/Dieterbe/go-metrics.(*WindowSample).Update
         0     0% 99.16%     5.41GB 99.16%  github.com/graphite-ng/carbon-relay-ng/_third_party/github.com/Dieterbe/go-metrics.(*StandardHistogram).Update
         0     0% 99.16%     5.40GB 99.01%  github.com/graphite-ng/carbon-relay-ng/_third_party/github.com/Dieterbe/go-metrics.(*StandardTimer).Update
         0     0% 99.16%     5.40GB 99.09%  main.(*Conn).HandleData
         0     0% 99.16%     5.45GB   100%  runtime.goexit

That would be where all the memory is going. Sounds like this is broken in metrics_wrapper.go:

    //histogram: NewHistogram(NewExpDecaySample(1028, 0.015)),
    histogram := metrics.NewHistogram(metrics.NewWindowSample())

WindowSample just accumulates values forever, never freeing them, until it is explicitly cleared. ExpDecaySample is limited to 1028 samples, in this case. I'm changing both instances of WindowSample to ExpDecaySample in metrics_wrapper.go and seeing how that goes over the next few days.

marcan avatar Jun 02 '15 06:06 marcan

Hi marcan,

I haven't dug through the code but just one thing.... you aren't sending internal metrics from carbon-relay-ng anywhere....

[instrumentation] graphite_addr = "" graphite_interval = 1000

Maybe if you were it would be flushing the WindowSamples?

TransactCharlie avatar Jun 05 '15 08:06 TransactCharlie

That's possible, though I'm not currently interested in the instrumentation and just left that section untouched from the sample config file :-)

marcan avatar Jun 11 '15 12:06 marcan

hmm i have to check the go-metrics code again, it is possible that certain metrics types require to be collected for them to be aggregated or their datastructures to be trimmed/reset.

Dieterbe avatar Jun 18 '15 22:06 Dieterbe

btw i created a little lib that automatically creates a memory profile when a certain memory usage is reached. see https://github.com/Dieterbe/profiletrigger anyone feel like building this in and PR'ing? :) should help all you in tracking this down..

Dieterbe avatar Jun 25 '15 19:06 Dieterbe

Any news about this issue ? I am running into it right now.

Is it linked to a bad 'instrumentation' section ?

olivierHa avatar Feb 01 '16 15:02 olivierHa

@Dieterbe I did the same as @marcan (changing metrics.NewWindowSample() to metrics.NewExpDecaySample(1028, 0.015)) and carbon-relay-ng stopped to leak memory. I can make a PR with that, but I have a lack of understanding how metrics histogram should work in this case.

@olivierHa nope, it's has nothing to do with instrumentation section.

ptqa avatar Feb 08 '16 09:02 ptqa

On my side, this was the instrumentatio :( If you put a "wrong" or empty address to flush internal metrics, I guess metrics are just 'stacking' in memory until oom occurs.

olivierHa avatar Feb 08 '16 09:02 olivierHa

I see. Well, in my case there is no instrumentation section at all and it's leaking.

ptqa avatar Feb 08 '16 12:02 ptqa

In my case OOM stopped when instrumentation section was configured.

pkoraca avatar Feb 08 '16 13:02 pkoraca

So I've got it set up to flush instrumentation, and I also tweaked it as suggested above to use NewExpDecaySample, but neither made any difference. It constantly receives a low level of incoming metrics (5-15/s) in my test configuration, and it eventually gets killed by the kernel.

I tried to get it to use the referenced profiletrigger, but I haven't had any luck (nothing ever gets written). Basically I did:

        if *memprofile != "" {
               log.Warning("Triggering memory profiling at 1GB to '%s'\n", *memprofile)
               errors := make(chan error)
               trigger, _ := heap.New(*memprofile, 1000000000, 60, time.Duration(1)*time.Second, errors)
               go trigger.Run()
               go func() {
                       for e := range errors {
                               log.Fatal("profiletrigger heap saw error:", e)
                       }
              }()
               log.Warning("Started memprofile error thread")
       }

I don't know too much about Go, so I may be doing something simple wrong.

rlodge avatar Jul 29 '16 21:07 rlodge

the profiletrigger should be unrelated to the memprofile flag, so instead of using *memprofile you should use a path to a writeable directory. also make sure your threshold makes sense for your system, i believe it's in bytes.

Dieterbe avatar Aug 01 '16 07:08 Dieterbe

Ah, wasn't clear to me it was supposed to be a directory.

Here's what I get, from a couple of profiles about 3 minutes apart. My threshold is low at the moment to make it happen quickly, but it will use up pretty much all the system's memory in a few hours. I've got an older version of the program running on a different system entirely (different environment, os), handling about 10K metrics/second, and it hasn't been restarted since 6/10. I'm testing this system, and it's only receiving 6-10 metrics/second, but it will run out of memory and the process manager will restart the relay service in 4-6 hours.

go tool pprof /usr/local/bin/carbon-relay-ng /appdata/carbon-relay-ng-profiles/1470066792.profile-heap
(pprof) top 10
821.70MB of 822.21MB total (99.94%)
Dropped 54 nodes (cum <= 4.11MB)
      flat  flat%   sum%        cum   cum%
  478.75MB 58.23% 58.23%   478.75MB 58.23%  main.(*keepSafe).keepClean
  219.14MB 26.65% 84.88%   219.14MB 26.65%  main.NewWriter
  103.06MB 12.53% 97.42%   336.58MB 40.94%  main.NewConn
   13.88MB  1.69% 99.10%    13.88MB  1.69%  main.NewKeepSafe
    6.87MB  0.84% 99.94%     6.87MB  0.84%  github.com/graphite-ng/carbon-relay-ng/badmetrics.New
         0     0% 99.94%   336.58MB 40.94%  main.(*Destination).updateConn
         0     0% 99.94%     6.87MB  0.84%  main.main
         0     0% 99.94%   822.21MB   100%  runtime.goexit
         0     0% 99.94%     6.87MB  0.84%  runtime.main
(pprof) top10 -cum
821.70MB of 822.21MB total (99.94%)
Dropped 54 nodes (cum <= 4.11MB)
      flat  flat%   sum%        cum   cum%
         0     0%     0%   822.21MB   100%  runtime.goexit
  478.75MB 58.23% 58.23%   478.75MB 58.23%  main.(*keepSafe).keepClean
         0     0% 58.23%   336.58MB 40.94%  main.(*Destination).updateConn
  103.06MB 12.53% 70.76%   336.58MB 40.94%  main.NewConn
  219.14MB 26.65% 97.42%   219.14MB 26.65%  main.NewWriter
   13.88MB  1.69% 99.10%    13.88MB  1.69%  main.NewKeepSafe
    6.87MB  0.84% 99.94%     6.87MB  0.84%  github.com/graphite-ng/carbon-relay-ng/badmetrics.New
         0     0% 99.94%     6.87MB  0.84%  main.main
         0     0% 99.94%     6.87MB  0.84%  runtime.main

go tool pprof /usr/local/bin/carbon-relay-ng /appdata/carbon-relay-ng-profiles/1470066982.profile-heap
Entering interactive mode (type "help" for commands)
(pprof) top 10
849.39MB of 849.90MB total (99.94%)
Dropped 58 nodes (cum <= 4.25MB)
      flat  flat%   sum%        cum   cum%
  508.82MB 59.87% 59.87%   508.82MB 59.87%  main.(*keepSafe).keepClean
  226.97MB 26.71% 86.57%   226.97MB 26.71%  main.NewWriter
  106.74MB 12.56% 99.13%   334.21MB 39.32%  main.NewConn
    6.87MB  0.81% 99.94%     6.87MB  0.81%  github.com/graphite-ng/carbon-relay-ng/badmetrics.New
         0     0% 99.94%   334.21MB 39.32%  main.(*Destination).updateConn
         0     0% 99.94%     6.87MB  0.81%  main.main
         0     0% 99.94%   849.90MB   100%  runtime.goexit
         0     0% 99.94%     6.87MB  0.81%  runtime.main
(pprof) top10 -cum
849.39MB of 849.90MB total (99.94%)
Dropped 58 nodes (cum <= 4.25MB)
      flat  flat%   sum%        cum   cum%
         0     0%     0%   849.90MB   100%  runtime.goexit
  508.82MB 59.87% 59.87%   508.82MB 59.87%  main.(*keepSafe).keepClean
         0     0% 59.87%   334.21MB 39.32%  main.(*Destination).updateConn
  106.74MB 12.56% 72.43%   334.21MB 39.32%  main.NewConn
  226.97MB 26.71% 99.13%   226.97MB 26.71%  main.NewWriter
    6.87MB  0.81% 99.94%     6.87MB  0.81%  github.com/graphite-ng/carbon-relay-ng/badmetrics.New
         0     0% 99.94%     6.87MB  0.81%  main.main
         0     0% 99.94%     6.87MB  0.81%  runtime.main

rlodge avatar Aug 01 '16 16:08 rlodge

the profile above is not very conclusive but i suspect it was taken when carbon-relay-ng hasn't run long enough.

From the reports from various people in this ticket, as well as the code, it's clear that the problem is carbon-relay-ng's internal metrics that pile up, if it's not set up to send them anywhere. we need to do two things:

A) make this issue extra clear in readme and sample configs and set up default config to work around it, because too many folks are tripping over it. I'll do this now. B) actually solve it. maybe by using a different library, or by flushing the metrics to /dev/null if graphite_addr is empty. Unfortunately i have no time to do this on the short term.

Dieterbe avatar Mar 23 '17 10:03 Dieterbe

@Dieterbe is there a way to discard these internal metrics?

scalp42 avatar Aug 02 '19 07:08 scalp42

i don't have time right now to dig into the code (anyone else, feel free to do so). from what i recall no / not with our current implementation

Dieterbe avatar Aug 02 '19 14:08 Dieterbe

Hey can we add a flag just to disable the instrumentation if it's not needed? If you can help me by directing me to any relevant parts of the code which can help me better understand the situation, I'll be happy to try to contribute a fix. :)

shlok007 avatar Feb 07 '20 03:02 shlok007

@shlok007 , did you make any progress on this?

james-woods-mnf avatar Apr 04 '20 21:04 james-woods-mnf

Seems that we start memory profiling before we know we even want a memory profiler.

https://github.com/grafana/carbon-relay-ng/blob/1447b5c07d832958b159d23907ceda66074c2017/cmd/carbon-relay-ng/carbon-relay-ng.go#L167-L178

I'm going to take a poke at a PR for this.

robert-c-ewing avatar Mar 04 '24 14:03 robert-c-ewing