cortex Distributors using spectacular amount of memory

We're ingesting about 2M samples per second with ~105M time series across 75 ingesters and 95 distributors, and we're seeing our distributors ballooning past 70GB/pod and continuing to grow until the pod consumes all physical memory on the node and gets OOM killed. Ingester memory consumption, CPU time, and storage performance are all hovering very comfortably at a stable level.The only errors in the logs for both distributors and ingesters are some occasional "duplicate sample for timestamp" messages that we've previously been told to ignore (see #2832).

We're running Cortex v1.3.0 for all components, with the TSDB storage backend. Here's the args we're passing to the distributor pods:

-auth.enabled=false
-consul.hostname=consul.cortex-tsdb.svc.cluster.local:8500
-distributor.ha-tracker.cluster=cortex_ha_cluster
-distributor.ha-tracker.consul.hostname=consul.cortex-tsdb.svc.cluster.local:8500
-distributor.ha-tracker.enable=true
-distributor.ha-tracker.enable-for-all-users=true
-distributor.ha-tracker.prefix=prom_ha/
-distributor.ha-tracker.store=consul
-distributor.health-check-ingesters=true
-distributor.ingestion-burst-size=1000000
-distributor.ingestion-rate-limit=100000
-distributor.ingestion-rate-limit-strategy=global
-distributor.remote-timeout=2s
-distributor.replication-factor=3
-distributor.ring.consul.hostname=consul.cortex-tsdb.svc.cluster.local:8500
-distributor.ring.prefix=
-distributor.shard-by-all-labels=true
-limits.per-user-override-config=/etc/cortex/overrides.yaml
-log.level=info
-mem-ballast-size-bytes=1073741824
-ring.heartbeat-timeout=10m
-ring.prefix=
-server.grpc-max-recv-msg-size-bytes=104857600
-server.grpc-max-send-msg-size-bytes=104857600
-server.grpc.keepalive.max-connection-age=2m
-server.grpc.keepalive.max-connection-age-grace=5m
-server.grpc.keepalive.max-connection-idle=1m
-target=distributor
-validation.reject-old-samples=true
-validation.reject-old-samples.max-age=12h

And here's our JSONnet config (powered by the grafana cortex-mixin library):

local cortex = import 'cortex/cortex.libsonnet';
local tsdb = import 'cortex/tsdb.libsonnet';

cortex + tsdb {
  _config+:: {
    namespace: 'cortex-tsdb',

    storage_backend: 'aws',
    storage_engine: 'blocks',
    blocks_storage_bucket_name: '<snip>'
    storage_tsdb_s3_endpoint: '<snip>',
    cortex_store_gateway_data_disk_class: 'gp2',
    cortex_compactor_data_disk_class: 'gp2',
    cortex_ingester_data_disk_class: 'gp2',
    cortex_ingester_data_disk_size: '200Gi',

    aws_region: 'us-east-1',
    s3_bucket_name: '<snip>',

    // XXX: 3 is the default
    replication_factor: 3,

    // only needed for alert manager?
    external_url: '<snip>',

    query_frontend: {
      sharded_queries_enabled: true,
    },

    ruler_enabled: true,
    ruler_client_type: 's3',
    ruler_s3_bucket_name: '<snip>',

    overrides: {
      fake: {
        max_series_per_metric: 0,  // Disabled in favour of the max global limit
        max_series_per_user: 0,  // Disabled in favour of the max global limit

        max_global_series_per_user: 320000000,  // 320M
        max_global_series_per_metric: 10000000,  // 10M

        max_series_per_query: 100000,
        max_samples_per_query: 1000000,

        ingestion_rate: 25000000,  //25M
        ingestion_burst_size: 50000000,  // 50M
      },
    },
  },

  // route incoming traffic to the correct service
  frontend_ingress: {
    apiVersion: 'extensions/v1beta1',
    kind: 'Ingress',
    metadata: {
      name: 'cortex',
      annotations: {
        'kubernetes.io/ingress.class': 'nginx',
      },
    },
    spec: {
      rules:
        [
          {
            host: '<snip>'
            http: {
              paths: [
              {
                 path: '/',
                 backend: {
                   serviceName: 'query-frontend',
                   servicePort: 80,
                 },
               },
               {
                 path: '/compactor',
                 backend: {
                   serviceName: 'compactor',
                   servicePort: 80,
                 },
               },
               {
                 path: '/distributor',
                 backend: {
                   serviceName: 'distributor',
                   servicePort: 80,
                 },
               },
               {
                 path: '/ingester',
                 backend: {
                   serviceName: 'distributor',
                   servicePort: 80,
                 },
               },
               {
                 path: '/ruler',
                 backend: {
                   serviceName: 'ruler',
                   servicePort: 80,
                 },
               },
               {
                 path: '/store-gateway',
                 backend: {
                   serviceName: 'store-gateway',
                   servicePort: 80,
                 },
               },
               {
                 path: '/api/prom/push',
                 backend: {
                   serviceName: 'distributor',
                   servicePort: 80,
                 },
                },
              ],
            },
          },
        ],
    },
  },
} + {
  namespace+: {
    metadata+: {
      annotations: {
        'iam.amazonaws.com/permitted': '<snip>'),
      },
      labels: {
        name: 'cortex-tsdb',
      }
    },
  },

  // get rid of the CPU limits because of '<snip>'
  compactor_container+::
    $.util.resourcesLimits(null, '6Gi'),

  // raised from limit 4Gi
  distributor_container+::
    # $.util.resourcesLimits(null, '32Gi') +
    $.util.resourcesLimits(null, null) +
    $.util.resourcesRequests(2, '24Gi'),

  ingester_container+::
    $.util.resourcesLimits(null, '40Gi') +
    $.util.resourcesRequests('4', '20Gi'),

  querier_container+::
    $.util.resourcesLimits(null, '80Gi') +
    $.util.resourcesRequests('4', '40Gi'),

  store_gateway_container+::
    $.util.resourcesLimits(null, '48Gi') +
    $.util.resourcesRequests('2', '24Gi'),

  // disable auth everywhere to turn off multi-tenant
  alertmanager_args+:: {
    'auth.enabled': false,
  },
  distributor_args+:: {
    'auth.enabled': false,
    'log.level': 'info',
    'distributor.ha-tracker.store': 'consul',
    'distributor.ha-tracker.consul.hostname': 'consul.%s.svc.cluster.local:8500' % $._config.namespace,
    'distributor.ha-tracker.etcd.endpoints': null,
    'distributor.remote-timeout': '2s',

    // the 'cluster' label is already in use, so we use something else
    'distributor.ha-tracker.cluster': 'cortex_ha_cluster',

    // 100MB (bumped from default of 4MB)
    'server.grpc-max-recv-msg-size-bytes': 1024 * 1024 * 100,
    'server.grpc-max-send-msg-size-bytes': 1024 * 1024 * 100,
  },
  ingester_args+:: {
    'auth.enabled': false,
    'ingester.max-series-per-metric': 10000000,  // added to make our env work
    'ingester.max-series-per-user': 100000000,  // same. default is 5000000

    // 100MB (bumped from default of 4MB)
    'server.grpc-max-recv-msg-size-bytes': 1024 * 1024 * 100,
    'server.grpc-max-send-msg-size-bytes': 1024 * 1024 * 100,
  },
  ruler_args+:: {
    'auth.enabled': false,
  },
  querier_args+:: {
    'auth.enabled': false,
  },
  query_frontend_args+:: {
    'auth.enabled': false,
  },

  // dont create an etcd cluster
  etcd_cluster(name, size=3, version='3.3.13', env=[]):: {},

}

Ingestion is powered by a cluster of grafana-agent machines, each of which is remote_writeing to an Amazon ELB which is backed by our production K8S cluster.

If you need any other piece of our configs, logs, manifests, or metrics data, please don't hesitate to ask. We've been working to adopt Cortex for our production metrics infrastructure, and this is the last blocker before we can cut over all queries to Cortex, so we're very interested in solving this.

Sep 15 '20 22:09 amckinley

Example pprof (original SVG available on request): Screen Shot 2020-09-15 at 4 45 13 PM

Sep 15 '20 23:09 amckinley

How many cpus do your kubernetes nodes have? did you configure cpu limits on the distributors?

We had a similar issue because we had 32 cpu machines and the problem that was solved by setting up limits for the cpu of the distributors or using the environment variable GOMAXPROCS=requests.cpu

Sep 16 '20 11:09 friedrich-at-adobe

How many cpus do your kubernetes nodes have? did you configure cpu limits on the distributors?

c5.24xlarge EC2 instances, which have 96 vCPUs. We don't have CPU limits configured on these pods, but we request 2 CPUs. I'll try passing GOMAXPROCS and see if that changes anything.

Sep 16 '20 15:09 amckinley

Also, last night we tried passing GOGC=30 to these containers, and that managed to significantly improve the growth of the heap size. It now looks like the heap maximum size is at least bounded. On a hunch, I also decreased the number of distributors from 95 to 25, and the total amount of memory consumed by the distributor pods stayed the same.

Sep 16 '20 15:09 amckinley

I've seen Distributors OOMing on

Single slow ingesters (this is a bug tracked in https://github.com/cortexproject/cortex/issues/1895 , but it hasn't got a lot of attention lately).
After upgrading from v0.4 to v0.6 (tracked in https://github.com/cortexproject/cortex/issues/2053). This has never really been fixed either, but it's reproducible for multiple users

My assumption is that problem #2 only happens for users using compression / multi zone setups. I shared some pprofs here and there as well and they usually pointed out issues in the (de)compression

Sep 26 '20 08:09 weeco

Quick update: setting up GOMAXPROCS=requests.cpu. 2 in my case had unintended consequences, like increasing the latency of the distributors. I am now setting GOMAXPROCS=8 with better latency response.

By the way the memory I use for my measures is container_memory_working_set_bytes{container="distributor"} and not container_memory_usage_bytes{container="distributor"}. The reason why is on https://www.bwplotka.dev/2019/golang-memory-monitoring/

Sep 29 '20 09:09 friedrich-at-adobe

Can we get the heap profile data please? (not a screen dump or svg)

Oct 01 '20 14:10 bboreham

Can we get the heap profile data please? (not a screen dump or svg)

Sure, here you go: https://github.com/amckinley/cortex-heap

Oct 02 '20 18:10 amckinley

@bboreham anything else I can provide on our side? Happy to provide more heap profiles or try any tuning suggestions you have.

Oct 07 '20 19:10 amckinley

Sure, here you go: https://github.com/amckinley/cortex-heap

The heap profile looks good to me. The inuse space is about 6GB. I assume the 1GB allocated in main is the ballast (-mem-ballast-size-bytes). From this profile I personally can't find any symptom of a memory leak.

My suspect is that what you see is some sort of side effect of go GC behaviour. How often does the GC triffers? I've read above you set GOGC=30 so I would expect it triggers frequently, but when running with the default GOGC it may trigger too late. Just a guess.

Oct 09 '20 15:10 pracucci

Sorry @amckinley, missed your message.

The profile shows 6GB in use, and 42GB allocated since the start of the program. This is hard to square against the symptom of 70GB, but maybe taken from a different point in time.

Nearly all of the memory in use is from Snappy.Decode(), and there are 703 buffers live. This I think indicates a problem, as each buffer should be unmarshalled then discarded, so memory from Unmarshal() should at least equal that from Decode().

I wonder if you have 700 active push requests, which would suggest to me they're not getting passed through to ingesters fast enough. Can you get a goroutine dump? (curl .../debug/pprof/goroutine) Do you monitor the throughput and latency on calls into and out from distributors?

I see util.ParseProtoReader() returns the buffer, which we never use. It would be worth removing that, in case it is contributing to the problem.

Other possible points of interest: how many distributors? How many samples per incoming push request?

Oct 12 '20 16:10 bboreham

Just reading this makes me think this could be CPU throttling given all the discussion about CPU requests/limits and GOMAXPROCS. CPU throttling could be leading to GCs not being completed fast enough. It's easy for a program to exhaust it's available CPU CFS periods with a lot of threads/processes, so adjusting GOMAXPROCs could be helping with that.

Nov 07 '20 20:11 chancez

@bboreham sorry for the delay; I'm back to working on this now. Here's another heap dump (~50GB, this time of the particular distributor that's at max for our cluster), and here's a goroutine dump of the same distributor.

We're currently running with 30 distributors and 75 ingesters. We're using this very... unorthodox queue_config:

            queue_config:
              capacity: 100000
              max_samples_per_send: 20000
              min_shards: 10
              max_shards: 10

We did a tremendous amount of testing using the suggested settings of MUCH higher values for max_shards, and every configuration we tried would always end up with the WAL growing without bound on our grafana-agent hosts. In desperation we tried the above settings and that's when our cluster finally stabilized.

Nov 19 '20 18:11 amckinley

Thanks, that adds a bit more detail, and you managed to snap the heap at a point where in-use was high.

The memory is still all being allocated by Snappy:

github.com/cortexproject/cortex/pkg/util.ParseProtoReader
/go/src/github.com/cortexproject/cortex/pkg/util/http.go

  Total:           0    33.29GB (flat, cum) 96.85%
    106            .          .           		if err == nil && len(body) <= maxSize { 
    107            .    33.14GB           			body, err = snappy.Decode(nil, body) 
    108            .          .           		}

This time there are 4,227 blocks allocated, so an average size of 7.5MB,. which not out of line for 20,000 samples (~390 bytes per sample including all label names and values).

The other dump says there are a total of 355 goroutines, which doesn't support my theory of push requests getting stuck, but it was taken 8 minutes after the heap dump so maybe conditions had changed. It would be better to get two dumps close together in time.

Next theory: something is getting retained from inside the buffer. We've had issues like this before, but that was in the ingester; the distributor is more stateless.

Perhaps it could relate to this change in Go: https://github.com/golang/go/commit/2dcbf8b3691e72d1b04e9376488cef3b6f93b286, whereby sync.Pool is not cleared completely on every GC. We do pool outgoing requests to ingesters, and I can believe that those have pointers into the incoming buffer. If that's the case, taking more care to zero out those pointers would fix it.

I'm struck that your 20,000 samples per send is much higher than I use (1,000), but you say you did a lot of experimenting. Certainly if those incoming buffers were smaller then it wouldn't matter to have 4,000 of them retained.

Ideally what we should really do is walk the chain of pointers to those buffers up to the root, so we know what is keeping them alive. But I don't know of a way to do that in Go - we need a tool like goheapdump which has been updated to match current Go formats.

Nov 20 '20 09:11 bboreham

This still sounds like it could be the GC not running fast enough/keeping up, especially if container CPU limits are set. Is there a good way to verify that?

Nov 20 '20 17:11 chancez

@chancez that doesn't match my understanding of how Go works.

You can set the environment variable GODEBUG=gctrace=1 to get a line about each garbage-collection in the log file.

Note from earlier "We don't have CPU limits configured on these pods".

Nov 20 '20 18:11 bboreham

@bboreham I wasn't referring to strictly limits, but CPU requests also. It's all CFS quotas, just hard/soft limits. They said they use CPU requests:

We don't have CPU limits configured on these pods, but we request 2 CPUs.

CPU Requests get's translated to --cpu-shares under docker, which will result in less CPU time if there's other CPU activity on the host besides the distributor. The Go program might be getting less CPU time than it needs for keeping up with GC, especially when it has a high GOMAXPROCS, which means there's a lot of threads competing for a small amount of CPU time. This might also be why a few people had success lowering GOMAXPROCS.

Nov 20 '20 22:11 chancez

I can believe that those have pointers into the incoming buffer

Well, I checked, and we take care to blank out those pointers. Strike that theory. https://github.com/cortexproject/cortex/blob/7f85f92480a27171da4980c556c3a2d07dc30b87/pkg/ingester/client/timeseries.go#L280-L283

@amckinley I see you have 'ha' settings - do you have a lot of clusters? (I.e. could you have thousands of distinct values of cortex_ha_cluster and __replica__) ?

We should copy the strings returned by findHALabels(), so they don't point into the gRPC buffer when retained by the ha tracker.

Nov 28 '20 14:11 bboreham

@bboreham We have 8 clusters, each of which has 2 replicas. (Actually, we have one huge cluster, but in order to make grafana-agent work, we had to create 8 distinct scrape configs and use the Prometheus hashmod relabel operator to get each agent to only scrape 1/8 of our total metrics). Screen Shot 2020-12-01 at 12 35 57 PM

Dec 01 '20 20:12 amckinley

I have similar issue with ha-tracker enabled and 1.7.0. Will the distributor's memory be lower without ha-tracker?

Mar 13 '21 15:03 anarcher

@anarcher it is quite possible there is a bug causing this high memory usage in the HA-tracker code. Without pinpointing the true cause we cannot say for sure.

Apr 08 '21 14:04 bboreham

Returning to this:

could you have thousands of distinct values of cortex_ha_cluster and __replica__ ?

These metrics use strings which point into the incoming buffer, hence will cause it to be retained: https://github.com/cortexproject/cortex/blob/523dde19b8c841f62152a5cd6110cf67bd5540b9/pkg/distributor/distributor.go#L605 https://github.com/cortexproject/cortex/blob/523dde19b8c841f62152a5cd6110cf67bd5540b9/pkg/distributor/ha_tracker.go#L392

So it's really "thousands of distinct combinations of tenant (user) ID and cluster". But I still don't think it matches what was seen.

Still, we should copy the string before using it in a place that may be retained.

Sep 14 '21 11:09 bboreham

Hi, the cause is most likely caused by a memory leak that was fixed in https://github.com/cortexproject/cortex/pull/4739

Another issue reported lower memory usage after they deployed 1.13.0

I will resolve this issue, but feel free to re-open if the issue is still there after upgrading to 1.13.0.

Nov 22 '22 20:11 alvinlin123

cortex cortex copied to clipboard

Distributors using spectacular amount of memory

cortex
cortex copied to clipboard