go-carbon [BUG] Go-carbon oooming on system with 128 gigs of ram

Describe the bug Go-carbon 0.16.2 ( a local fork, including only the 404_ok PR that was recently merged into official master) is OOMing on systems with 128 gigabytes of ram. This happens while carbonserver is rebuilding the metric list index, typically impacting 1-2 systems from a 50 node cluster that is taking reads. OOMs do not happen in the backup cluster (still 50 nodes) that is not taking reads, despite only having 64 gigs of ram!

A recent OOM from dmesg OOM.txt

Shortly after an OOM, trying to rebuild the index: ps aux | grep go-carbon _graphite 22186 783 63.0 87037500 82202968 ? Ssl 14:31 146:35 /usr/bin/go-carbon -config /etc/carbon/carbon.conf

Heap from gprof pprof_heap

Logs ** carbon.info.log** Nothing interesting here.

carbon.warn.log A typical log snippet. Plenty of 30 second timeouts (as per config) during the metric list rebuild. I think the io timeout on 2003 is the node trying to self-report carbon data to the LB (Hmmm, which makes me wonder if I'm dealing with socket starvation)

[2022-03-17T15:24:23.182Z] ERROR [access] fetch failed {"handler": "render", "url": "/render/?format=pickle&local=1&noCache=1&from=1647530300&until=1647530600&target=identity_protection.%2A.rabbitmq_overview.messages&now=1647530600", "peer": "10.x.x.67:57190", "format": "pickle", "targets": ["identity_protection..rabbitmq_overview.messages"], "runtime_seconds": 32.261091231, "reason": "failed to read data", "http_code": 400, "error": "could not expand globs - context canceled"} [2022-03-17T15:24:23.183Z] ERROR [access] fetch failed {"handler": "render", "url": "/render/?format=pickle&local=1&noCache=1&from=1647526700&until=1647530600&target=xxxx_xxxx.%2A.rabbitmq_overview.messages&now=1647530600", "peer": "10.x.x.67:57144", "format": "pickle", "targets": ["identity_protection..rabbitmq_overview.messages"], "runtime_seconds": 32.170025934, "reason": "failed to read data", "http_code": 400, "error": "could not expand globs - context canceled"} [2022-03-17T15:24:23.261Z] ERROR [access] fetch failed {"handler": "render", "url": "/render/?format=pickle&local=1&noCache=1&from=1647530300&until=1647530600&target=xx-xxxx.%2A.gauge.sqsredriver.%2A.approx-visible.value&now=1647530600", "peer": "10.x.x.145:33558", "format": "pickle", "targets": ["xx-xxxxx..gauge.sqsredriver..approx-visible.value"], "runtime_seconds": 31.216498568, "reason": "failed to read data", "http_code": 400, "error": "could not expand globs - context canceled"} [2022-03-17T15:24:24.647Z] ERROR [access] fetch failed {"handler": "render", "url": "/render/?format=pickle&local=1&noCache=1&from=1647529732&until=1647530632&target=datamanager.%2A.timer.datamanager.kafka.events.producer.ingress.%2A.%2A.95-percentile&now=1647530632", "peer": "10.x.x.102:38438", "format": "pickle", "targets": ["datamanager..timer.datamanager.kafka.events.producer.ingress...95-percentile"], "runtime_seconds": 31.05827981, "reason": "failed to read data", "http_code": 400, "error": "could not expand globs - context canceled"} [2022-03-17T15:24:24.647Z] ERROR [access] fetch failed {"handler": "render", "url": "/render/?format=pickle&local=1&noCache=1&from=1647529702&until=1647530602&target=datamanager.%2A.timer.datamanager.kafka.events.producer.ingress.%2A.%2A.95-percentile&now=1647530602", "peer": "10.x.x.28:54460", "format": "pickle", "targets": ["datamanager..timer.datamanager.kafka.events.producer.ingress...95-percentile"], "runtime_seconds": 31.068586559, "reason": "failed to read data", "http_code": 400, "error": "could not expand globs - context canceled"} [2022-03-17T15:24:25.057Z] ERROR [stat] dial failed {"endpoint": "tcp://internal-graphite-relay01-xxxxxxxx-us-xxxxx-1.elb.amazonaws.com:2003", "error": "dial tcp 10.20.12.148:2003: i/o timeout"} [2022-03-17T15:24:25.246Z] ERROR [access] fetch failed {"handler": "render", "url": "/render/?format=pickle&local=1&noCache=1&from=1647530004&until=1647530604&target=DeviceAPI.%2A.histogram.api.devices.entities.get.time.95-percentile&now=1647530604", "peer": "10.x.x.22:35546", "format": "pickle", "targets": ["DeviceAPI.*.histogram.api.devices.entities.get.time.95-percentile"], "runtime_seconds": 29.699880841, "reason": "failed to read data", "http_code": 400, "error": "could not expand globs - context canceled"}

If applicable, add logs (please use log level debug) that shows the whole duration of the issue.

Go-carbon Configuration:

carbon.conf.txt

Metric retention and aggregation schemas

We are not doing aggregation on either the relay layer or the cache layer.

Our typical aggregation pattern is 60s:14d,300s:90d.

Additional context

Our metrics are not wellformed (yet). e.g. we are still on the old "service.ip-adress.metric.name.subname " format, which trie is known to not like well.

We trim most metrics to 90 days. We trim our biggest sets of metrics to 30 days.

Our gzipped metriclist_cache.gz is typically around 684,692,860 bytes. I did some uniq -c + sed magic and came up with the following unique count per service, with the service name anonymized. E.G. our biggest reporting service has 7,981,192 metric files under /graphite/whisper/servicename/*

metric_count.txt

We have 365 uniq services with a highly variable host count. We have uniq top level service name metrics. 56 of these services have at least 500 subdirectories (e.g. /graphite/whisper/servicename/* ), with the worst being at 57,984 subdirectories.

service_dirs.txt

Goroutines from gprof goroutines.txt

Mar 17 '22 16:03 jdblack

Hmm, is the oom happening during go-carbon restart? The pprof image seems indicates it.

if it's indeed during restart, you can consider enabling this flag:

[carbonserver]
no-service-when-index-is-not-ready = true
file-list-cache = "/graphite/metriclist_cache.gz"

no-service-when-index-is-not-ready works better with file-list-cache.

uniq -c + sed magic and came up with the following unique count per service, with the service name anonymized. E.G. our biggest reporting service has 7,981,192 metric files under /graphite/whisper/servicename/*

8m uniq metrics per instance is ok with trie index. We have instances serving 10m uniq metrics.

Mar 17 '22 16:03 bom-d-van

Our gzipped metriclist_cache.gz is typically around 684,692,860 bytes.

oh, wait, 700MB of gzip. how many uniq metrics are on the instance? (I think it's also due to go-carbon is caching dirs entries in the cache, which can be skipped).

I did some uniq -c + sed magic and came up with the following unique count per service, with the service name anonymized. E.G. our > biggest reporting service has 7,981,192 metric files under /graphite/whisper/servicename/*

hmm, does that mean you have queries trying to match 8m uniq metrics in a query?

Mar 17 '22 17:03 bom-d-van

no-service-when-index-is-not-ready might actually work! I was concerned that carbonserver would pass a 5xx up to graphite-web, which would break all the queries with grafana, but that seems to not be the case.

This problem can happen to systems that have been up for a while, but once it falls into the behavior, it seems to get stuck. The system I'm looking at has 128GB of ram (Upgraded on Tuesday from 64GB) looks like it's been strugglign since the start:

root@ip-10-22-8-130:/var/log# dmesg -T | grep Out [Tue Mar 15 16:17:04 2022] Out of memory: Killed process 891 (go-carbon) total-vm:159363300kB, anon-rss:125300180kB, file-rss:0kB, shmem-rss:0kB, UID:1343 pgtables:260056kB oom_score_adj:0 [Tue Mar 15 20:02:29 2022] Out of memory: Killed process 7021 (go-carbon) total-vm:174179616kB, anon-rss:128387696kB, file-rss:0kB, shmem-rss:0kB, UID:1343 pgtables:271248kB oom_score_adj:0 [Tue Mar 15 21:48:46 2022] Out of memory: Killed process 11188 (go-carbon) total-vm:162317480kB, anon-rss:125159960kB, file-rss:3568kB, shmem-rss:0kB, UID:1343 pgtables:261096kB oom_score_adj:0 [Wed Mar 16 04:11:36 2022] Out of memory: Killed process 457 (go-carbon) total-vm:161195268kB, anon-rss:125945856kB, file-rss:0kB, shmem-rss:0kB, UID:1343 pgtables:261064kB oom_score_adj:0 [Wed Mar 16 04:50:11 2022] Out of memory: Killed process 10578 (go-carbon) total-vm:173894384kB, anon-rss:127815288kB, file-rss:0kB, shmem-rss:0kB, UID:1343 pgtables:269796kB oom_score_adj:0 [Wed Mar 16 05:55:40 2022] Out of memory: Killed process 3007 (go-carbon) total-vm:129477516kB, anon-rss:124186752kB, file-rss:0kB, shmem-rss:0kB, UID:1343 pgtables:244652kB oom_score_adj:0 [Wed Mar 16 06:36:36 2022] Out of memory: Killed process 8596 (go-carbon) total-vm:176841836kB, anon-rss:127188316kB, file-rss:0kB, shmem-rss:0kB, UID:1343 pgtables:270688kB oom_score_adj:0 [Wed Mar 16 07:46:05 2022] Out of memory: Killed process 31056 (go-carbon) total-vm:155602632kB, anon-rss:125756616kB, file-rss:0kB, shmem-rss:0kB, UID:1343 pgtables:257660kB oom_score_adj:0 [Wed Mar 16 08:19:10 2022] Out of memory: Killed process 7120 (go-carbon) total-vm:165953636kB, anon-rss:128133128kB, file-rss:0kB, shmem-rss:0kB, UID:1343 pgtables:266620kB oom_score_adj:0 [Wed Mar 16 08:51:51 2022] Out of memory: Killed process 16282 (go-carbon) total-vm:163478184kB, anon-rss:125988776kB, file-rss:0kB, shmem-rss:0kB, UID:1343 pgtables:262616kB oom_score_adj:0 [Wed Mar 16 09:28:39 2022] Out of memory: Killed process 25332 (go-carbon) total-vm:169508304kB, anon-rss:126565932kB, file-rss:0kB, shmem-rss:0kB, UID:1343 pgtables:266044kB oom_score_adj:0 [Wed Mar 16 14:45:10 2022] Out of memory: Killed process 28837 (go-carbon) total-vm:160530988kB, anon-rss:126443980kB, file-rss:0kB, shmem-rss:0kB, UID:1343 pgtables:261224kB oom_score_adj:0 [Wed Mar 16 22:19:10 2022] Out of memory: Killed process 9348 (go-carbon) total-vm:166419248kB, anon-rss:125571556kB, file-rss:3872kB, shmem-rss:0kB, UID:1343 pgtables:264540kB oom_score_adj:0 [Wed Mar 16 23:23:39 2022] Out of memory: Killed process 352 (go-carbon) total-vm:159714576kB, anon-rss:124611044kB, file-rss:0kB, shmem-rss:0kB, UID:1343 pgtables:258740kB oom_score_adj:0 [Thu Mar 17 12:24:46 2022] Out of memory: Killed process 23152 (go-carbon) total-vm:172796660kB, anon-rss:127488288kB, file-rss:0kB, shmem-rss:0kB, UID:1343 pgtables:268928kB oom_score_adj:0 [Thu Mar 17 16:51:42 2022] Out of memory: Killed process 19998 (go-carbon) total-vm:162008564kB, anon-rss:123831604kB, file-rss:1804kB, shmem-rss:0kB, UID:1343 pgtables:259364kB oom_score_adj:0

The thing that's really throwing me here is this is one of a cluster of 50 with rf=1. A cluster that as recently as last week with 64 gigs of ram. Perhaps we have a user that is hitting us with horrific queries . As mentioned, our worst service has 57 unique hosts (e.g. ls /graphite/whisper/someservice | wc -l = 57,000 !!!!)

The gut impression that I'm getting is that the find cache build is chewing up a lot of memory. That causes the system to dump it's disk cache to satisfy malloc requests. Reads go crazy, queries start backing up, and the system spirals into doom.

Mar 17 '22 17:03 jdblack

We trim most metrics to 90 days. We trim our biggest sets of metrics to 30 days. Our typical aggregation pattern is 60s:14d,300s:90d.

maybe you can consider trying out compression. could reduce io a bit.

[whisper]
# specify to enable/disable compressed format (EXPERIMENTAL)
# See details and limitations in https://github.com/go-graphite/go-whisper#compressed-format
# IMPORTANT: Only one process/thread could write to compressed whisper files at a time, especially when you are
# rebalancing graphite clusters (with buckytools, for example), flock needs to be enabled both in go-carbon and your tooling.
compressed = true

Mar 17 '22 17:03 bom-d-van

no-service-when-index-is-not-ready might actually work! I was concerned that carbonserver would pass a 5xx up to graphite-web, which would break all the queries with grafana, but that seems to not be the case.

This problem can happen to systems that have been up for a while, but once it falls into the behavior, it seems to get stuck. The system I'm looking at has 128GB of ram (Upgraded on Tuesday from 64GB) looks like it's been strugglign since the start:

we introduced this flag because we had an issue with heavy queries during restart. because go-carbon falls back to filepath.Glob when index is not ready, which might take cpu away (high memory allocation and go GC). it only has effect after restart before the first index is built.

The thing that's really throwing me here is this is one of a cluster of 50 with rf=1. A cluster that as recently as last week with 64 gigs of ram. Perhaps we have a user that is hitting us with horrific queries . As mentioned, our worst service has 57 unique hosts (e.g. ls /graphite/whisper/someservice | wc -l = 57,000 !!!!)

max-metrics-rendered = 1000000

The max-metrics-rendered is a bit too high. Returning 1m metrics in one query is most likely going to take away all the memories.

Mar 17 '22 17:03 bom-d-van

I'm not seeing any crazy queries at the moment, this looks like we get it pretty rough at times. I suspect things go sideways when we get a series of big queries right in the middle of a metric fire.

history

Mar 17 '22 17:03 jdblack

Ahh, I set that to a million based on the commented out request. What's a sane ceiling that we can set? 100k? 10k?



# Maximum metrics could be returned in render request (works both all types of
# indexes) (Default: 1,000,000)
max-metrics-rendered = 1000

Mar 17 '22 17:03 jdblack

Ahh, I set that to a million based on the commented out request. What's a sane ceiling that we can set? 100k? 10k?

Yep. We have a 10k ceiling on carbonapi. so 10k on go-carbon seems ok. 1m default was originally set to avoid causing backward-incompatible issues of the flag. but maybe we should update the doc to have a recommend value there.

Mar 17 '22 17:03 bom-d-van

In regards to compression, we might be do that now, now that we've gotten graphite-web kicked off of the caches.

I'm supposing that turning compression on (or off) will cause go-carbon to honor the existing compression state for any existing whisper file, correct? If we turn compression on and realize we don't have enough CPU to handle it, would it be easy to return to a non-compressed state?

Mar 17 '22 17:03 jdblack

This seems more user error and less bug. I'd remove the bug label and replace it with question if I could, but it seems I lack the ability to do so.

Mar 17 '22 17:03 jdblack

This seems more user error and less bug. I'd remove the bug label and replace it with question if I could, but it seems I lack the ability to do so.

applied.

I'm supposing that turning compression on (or off) will cause go-carbon to honor the existing compression state for any existing whisper file, correct? If we turn compression on and realize we don't have enough CPU to handle it, would it be easy to return to a non-compressed state?

for this, probably better do it a control way. currently go-carbon doesn't convert standard whisper files to compressed automatically and vice versa. it only affects new files. so if you enable it and disable it again, compressed files stay compressed.

You can consider only enable it for a subset of metrics matched by schema rules: https://github.com/go-graphite/go-carbon/blob/master/deploy/storage-schemas.conf#L9

[cwhipser_test]
pattern = service.xxx..*
retentions = 60s:30d,1h:5y
compressed = true

Mar 17 '22 17:03 bom-d-van

You can consider only enable it for a subset of metrics matched by schema rules: https://github.com/go-graphite/go-carbon/blob/master/deploy/storage-schemas.conf#L9

oh just in case, if you use buckytools, it currently doesn't support mixing standard/compressed whisper files. that's the downside of using partial compression.

Mar 17 '22 17:03 bom-d-van

The thing that's really throwing me here is this is one of a cluster of 50 with rf=1. A cluster that as recently as last week with 64 gigs of ram. Perhaps we have a user that is hitting us with horrific queries . As mentioned, our worst service has 57 unique hosts (e.g. ls /graphite/whisper/someservice | wc -l = 57,000 !!!!)

yep. maybe trying out the the few flags that are mentioned above, and see if it improves things. (compression probably doesn't help in this case, but it's something you can explore in the future).

Mar 17 '22 17:03 bom-d-van

Nope. The only thing that reads and writes whisper these days is go-carbon.

Our caches in this cluster hold close to a terabyte of data each, making the movement of data rather slow and expensive. We manage our scaling by building out new clusters and waiting at least 30 days before tearing down the old clusters.

Mar 17 '22 17:03 jdblack

The thing that's really throwing me here is this is one of a cluster of 50 with rf=1. A cluster that as recently as last week with 64 gigs of ram. Perhaps we have a user that is hitting us with horrific queries . As mentioned, our worst service has 57 unique hosts (e.g. ls /graphite/whisper/someservice | wc -l = 57,000 !!!!)

Pardon, I meant to say 57K unique hosts.

Mar 17 '22 18:03 jdblack

go-carbon go-carbon copied to clipboard

[BUG] Go-carbon oooming on system with 128 gigs of ram

go-carbon
go-carbon copied to clipboard