cortex store-gateway pods lose connection to memcached and end up crashing

Describe the bug store-gateway pods will fail to connect to memcached with an i/o timeout error as seen in: https://github.com/thanos-io/thanos/issues/1979

{"caller":"memcached_client.go:382","err":"read tcp 192.168.68.132:59858->10.232.216.84:11211: i/o timeout","firstKey":"P:01F4AB8BMEH03WEXNMY516NMC8:mauaq0QeA5VHuyjZHSsF5P_DU_MPW7BLqAbRy42_Z2I","level":"warn","msg":"failed to fetch items from memcached","numKeys":2,"ts":"2021-04-27T23:27:10.227890794Z"}

To Reproduce Steps to reproduce the behavior:

Start Cortex (1.8.0)
Perform Operations(Read/Write/Others)
Ingest 1.1 million metrics per second
run store-gateway as a stateful set hooked up to s3
wait for them to fail by losing connection to memcached

Expected behavior I expect store-gateway pods to run for a long time without crashing

Environment:

Infrastructure: Kubernets / EKS / s3
Deployment tool: helm

Storage Engine

[x] Blocks
[ ] Chunks

Additional Context

Apr 28 '21 23:04 cabrinha

An error connecting to memcached or an i/o timeout doesn't compromise the correct functioning of Cortex but can have a negative impact on query performances. That being said, temporarily and sporadic errors can happen. If the issue is persistent, than I would suggest you to investigate it. Some inputs:

Try to increase the memcached client timeout
Ensure memcached is scaled up correctly (eg. doesn't hit the max connections limits in memcached server)
Scale up memcached to more replicas

Apr 29 '21 15:04 pracucci

An error connecting to memcached or an i/o timeout doesn't compromise the correct functioning of Cortex but can have a negative impact on query performances. That being said, temporarily and sporadic errors can happen. If the issue is persistent, than I would suggest you to investigate it. Some inputs:

Try to increase the memcached client timeout

Ensure memcached is scaled up correctly (eg. doesn't hit the max connections limits in memcached server)

Scale up memcached to more replicas

It's not just query performance that is impacted, it's actual queries:

"expanding series: failed to get store-gateway replication set owning the block 01F4AB90AE4ZS92FN8T8RMT5JW: at least 1 healthy replica required, could only find 0"

It doesn't look like our ElastiCache memcached nodes are under stress in the form of connections. We're running 4 nodes currently.

The NewConnections metric shows we're not hitting the limit of 65,000 connections yet: Screen Shot 2021-04-29 at 8 56 56 AM

I'll try your other suggestions as well and report back here.

Apr 29 '21 15:04 cabrinha

I've increased the memcached timeouts in the cortex configs to 60s.

Also doubled my memcached instances from 4 to 8. We're still not approaching the limits on connections there. ElastiCache instance types are cache.r6g.xlarge.

Here are some more logs, hoping the help a dev identify what the issue could be. These logs are common and spew out in no particular order.

{"caller":"gateway.go:315","level":"info","msg":"successfully synchronized TSDB blocks for all users","reason":"periodic","ts":"2021-05-03T16:44:20.450207989Z"}

{"caller":"memcached_client.go:382","err":"failed to wait for turn: context canceled","firstKey":"P:01F4QVV8HT4YRY7QGFX5C49TGM:Jlm4f_irgzioCNF9dEqEkKppNxmXQFiV9P1lsOgYN9g","level":"warn","msg":"failed to fetch items from memcached","numKeys":2,"ts":"2021-05-03T16:48:50.501378775Z"}\

{"addresses":"dns+us-west-2-x-cortex-memcached.x.cfg.usw2.cache.amazonaws.com:11211","caller":"memcached_client.go:560","err":"lookup IP addresses \"us-west-2-x-cortex-memcached.x.cfg.usw2.cache.amazonaws.com\": i/o timeout","level":"error","msg":"failed to resolve addresses for memcached","ts":"2021-05-03T16:50:53.613230819Z"}

{"caller":"client.go:234","err":"Get \"http://consul-server.cortex.svc.cluster.local:8500/v1/kv/collectors/store-gateway?index=54925797&stale=&wait=10000ms\": context canceled","key":"collectors/store-gateway","level":"error","msg":"error getting path","ts":"2021-05-03T16:50:59.945392587Z"}

{"caller":"gateway.go:313","err":"failed to synchronize TSDB blocks for user platform: read bucket index: read bucket index: Get \"https://us-west-2-x-cortex-blocks.s3.dualstack.us-west-2.amazonaws.com/platform/bucket-index.json.gz\": context canceled","level":"warn","msg":"failed to synchronize TSDB blocks","reason":"ring-change","ts":"2021-05-03T16:51:02.894971377Z"}

{"caller":"grpc_logging.go:55","duration":"3m1.534076921s","err":"rpc error: code = Aborted desc = fetch series for block 01F4RY5G990HW47TFPGEFV5XXY: expanded matching posting: get postings: read postings range: get range reader: Get \"https://us-west-2-x-cortex-blocks.s3.dualstack.us-west-2.amazonaws.com/platform/01F4RY5G990HW47TFPGEFV5XXY/index\": context canceled","level":"warn","method":"/gatewaypb.StoreGateway/Series","msg":"gRPC\n","ts":"2021-05-03T16:51:34.361004474Z"}

{"caller":"basic_lifecycler.go:441","err":"failed to CAS collectors/store-gateway","level":"warn","msg":"failed to heartbeat instance in the ring","ring":"store-gateway","ts":"2021-05-03T16:50:57.781708966Z"}
{"caller":"basic_lifecycler.go:197","level":"info","msg":"ring lifecycler is shutting down","ring":"store-gateway","ts":"2021-05-03T16:50:57.781725181Z"}

Currently using c5n.4xlarge instance types for store-gateway pods, and we've got 24 of them. Each instance shares about 4% of the ring ownership.

After a store-gateway pod crashes with a slurry of the errors above, it takes an hour or more of "loaded new block" and around 1TB of disk space before it posts ready status again. Network activity on the EC2 instance tops out around 17Gbps while it's downloading blocks.

index_header_lazy_loading_enabled is set to true.

Under blocks_storage section of the config the chunks_cache, index_cache and metadata_cache all use the "dns service discovery" settings of memcached.addresses: dns+<my ElastiCache memcached address>, which is listed in the docs as experimental, but no other options are given.

It's hard to believe this is just a DNS issue, because this is running entirely in AWS, without creating any extra records for memcached. We tried installing node-local-dns on the cluster but it didn't seem to help.

We're pushing between 1.2 and 1.5 million metrics per second with a replication factor of 3. Are we pushing Cortex too far? What other components need to be scaled up or looked at?

I'm happy to share more configs or logs.

I would like it if store-gateway pods took less time to start up and if there was something I could look into to uncover the mystery behind dns lookup failures and connection timeouts. I'm suspecting that I'm pushing the whole system just a bit too hard but I'm not sure where to look for evidence. Could Jaeger be of any help here?

May 03 '21 19:05 cabrinha

"expanding series: failed to get store-gateway replication set owning the block 01F4AB90AE4ZS92FN8T8RMT5JW: at least 1 healthy replica required, could only find 0"

This error means that there are not enough healthy store-gateways in the ring (eg. they crashed? OOMKilled?). This error should not be related in any way to memcached.

Here are some more logs, hoping the help a dev identify what the issue could be. These logs are common and spew out in no particular order. [...]

The error you posted above are all about context canceled. It means the store-gateway was shutting down and we can also see it from the log "ring lifecycler is shutting down".

After a store-gateway pod crashes with a slurry of the errors above, it takes an hour or more of "loaded new block" and around 1TB of disk space before it posts ready status again. Network activity on the EC2 instance tops out around 17Gbps while it's downloading blocks.

How many blocks do you have in the storage? You can look at the cortex_bucket_blocks_count metric is exported by compactor
Is the compactor running? This is essential to reduce the number of blocks
Are you using persistent disks for store-gateways?

We're pushing between 1.2 and 1.5 million metrics per second with a replication factor of 3. Are we pushing Cortex too far?

Not really. This is a fairly small scale for Cortex.

May 13 '21 14:05 pracucci

* How many blocks do you have in the storage? You can look at the `cortex_bucket_blocks_count` metric is exported by compactor

a little over 38,000 -- I run compactors with 5 replicas. 0 reports 38,000 and 4 reports 380.

* Is the compactor running? This is essential to reduce the number of blocks

Compactor is running and has no restarts. Logs are filled with:

{"caller":"compact.go:953","component":"compactor","level":"info","msg":"start of compactions","org_id":"platform","ts":"2021-05-14T17:24:23.963826675Z"}
{"caller":"compact.go:705","component":"compactor","group":"0@{__org_id__=\"platform\"}","groupKey":"0@17918867045866434399","level":"info","msg":"compaction available and planned; downloading blocks","org_id":"platform","plan":"[01F1DVV22RN260G4DFYYXYNWBG (min time: 1616436000000, max time: 1616443200000) 01F1DVTGREEFTPDK0NMW7N25EB (min time: 1616436000000, max time: 1616443200000) 01F1DVTQA0NW61SH3CXRBBWY06 (min time: 1616436000000, max time: 1616443200000) 01F1DVVJ1V333BV9VEKQMACYHJ (min time: 1616436000000, max time: 1616443200000) 01F1DVTE1103Q7ND7V8NZJ4GHK (min time: 1616436000000, max time: 1616443200000) 01F1DVVMNTNS7XZMZ56HW8V4VR (min time: 1616436000000, max time: 1616443200000) 01F1DVVE9D74510769AFQT6RR2 (min time: 1616436000000, max time: 1616443200000) 01F1DVVBGGV0JGWXFWABFT6TF4 (min time: 1616436000000, max time: 1616443200000) 01F1DVV6QQ9J7KQRRMZ533918Q (min time: 1616436000000, max time: 1616443200000) 01F1DVVNWDKRWVGFS7VBAG0N6E (min time: 1616436000000, max time: 1616443200000) 01F1DVT0D00JEXWC5GMJ37RA00 (min time: 1616436000000, max time: 1616443200000) 01F1DVT584RGWMVD9Z23BWH7D2 (min time: 1616436000000, max time: 1616443200000) 01F1DVV12ABNC0E1ZVS34C7M3M (min time: 1616436000000, max time: 1616443200000) 01F1DVTCH5GTV37E83X0GP08HB (min time: 1616436000000, max time: 1616443200000) 01F1DVVREQWNNNZ3D5YD1GBZ08 (min time: 1616436000000, max time: 1616443200000) 01F1DVT16VTDBPTRBT2ECDB210 (min time: 1616436000000, max time: 1616443200000) 01F1DVVAAFRP05P3MGAR3RZAFM (min time: 1616436000000, max time: 1616443200000) 01F1DVVM5Y7RJXV71MWBC7AQ1R (min time: 1616436000000, max time: 1616443200000) 01F1JM25FKWGXZ627VB5GRP17S (min time: 1616436000000, max time: 1616443200000) 01F1DVTH9D1SZGGBE32WHV6DG3 (min time: 1616436000000, max time: 1616443200000) 01F1DVTTQ77GBR8FTPD2WN38V8 (min time: 1616436000000, max time: 1616443200000) 01F1DVTP929KHGK7ZYQ3ZXTM9H (min time: 1616436000000, max time: 1616443200000) 01F1DVTM8XFRTNJDNEPQW0MSWH (min time: 1616436000000, max time: 1616443200000) 01F1JKGZYMH7TW7HBQDFC1APKT (min time: 1616436000000, max time: 1616443200000) 01F1DVT3FGFBAP1V9VQ108S6V9 (min time: 1616436000000, max time: 1616443200000) 01F1DVTWB94HNNT57AV6FTYHX8 (min time: 1616436000000, max time: 1616443200000) 01F1DVTJEBXNKEER31QMQTJHB2 (min time: 1616436000000, max time: 1616443200000) 01F1DVTYN4ZJF98BFFEGYAF0RR (min time: 1616436000000, max time: 1616443200000) 01F1DVVJBKA0PF779HS6Q9PZ0Z (min time: 1616436000000, max time: 1616443200000) 01F1DVVGHGA145SMPMRNTBJ8NP (min time: 1616436000000, max time: 1616443200000) 01F1DVT1GQBKDCF30F770VPQ7C (min time: 1616436000000, max time: 1616443200000) 01F1DVVRV8QE25ANDC8HXFGZRW (min time: 1616436000000, max time: 1616443200000) 01F1DVTTWJE9J1JHJVE1X0B2HR (min time: 1616436000000, max time: 1616443200000) 01F1DVV09D3T6JPCQD8JXP8S2Y (min time: 1616436000000, max time: 1616443200000) 01F1DVTPZE517RZAREB6SWBM7A (min time: 1616436000000, max time: 1616443200000) 01F1JK81YZAB2PSZ4WNYVT25RF (min time: 1616436000000, max time: 1616443200000) 01F1DVTR00323BPWBZKN7E8T06 (min time: 1616436000000, max time: 1616443200000) 01F1DVT2S7FWSSANWT23HSPYQX (min time: 1616436000000, max time: 1616443200000) 01F1DVVA1FH2YPR465C2CNHPQX (min time: 1616436000000, max time: 1616443200000) 01F1JMAZGX15BB55AMAZBY1YE7 (min time: 1616436000000, max time: 1616443200000) 01F1DVV9Y7XYBCYC0ST08077F2 (min time: 1616436000000, max time: 1616443200000) 01F1DVTE9Q5KDKRA55CVFZ9DV1 (min time: 1616436000000, max time: 1616443200000) 01F1DVT16BGB9SPFTZC38SDMWW (min time: 1616436000000, max time: 1616443200000) 01F1DVTH634EWYFXTT66DTMJTV (min time: 1616436000000, max time: 1616443200000) 01F1DVV7K3KHDTBZTGSFDRBCXB (min time: 1616436000000, max time: 1616443200000) 01F1DVVDJN2BSPTAVGZQJKY16T (min time: 1616436000000, max time: 1616443200000) 01F1DVX5RKCFMYWSB64KMMT3P9 (min time: 1616436000000, max time: 1616443200000) 01F1DVV7NG0M30QX714GQ82DM2 (min time: 1616436000000, max time: 1616443200000) 01F1DVVK0CJWV9GSZ60W6VCBPE (min time: 1616436000000, max time: 1616443200000) 01F1DVSYPMZF20A8ASCA0AK7VR (min time: 1616436000000, max time: 1616443200000) 01F1DVTRM4M5NY0EDBD9Q7G02J (min time: 1616436000000, max time: 1616443200000) 01F1DVTA384SATTDSQDJTZXMVM (min time: 1616436000000, max time: 1616443200000) 01F1DVTZT1Q64B2F7T6Q6CH422 (min time: 1616436000000, max time: 1616443200000) 01F1DVV7YP5DDSHQB05EZ7C74D (min time: 1616436000000, max time: 1616443200000) 01F1DVVDN1F26KAAHQ503FWNVQ (min time: 1616436000000, max time: 1616443200000) 01F1JKSWJGV82DADX2SFPYH1N4 (min time: 1616436000000, max time: 1616443200000) 01F1DVT70VPJ9PS8EBVWXHKMFF (min time: 1616436000000, max time: 1616443200000) 01F1DVSXRABHVKTHSVAC7PXXSJ (min time: 1616436000000, max time: 1616443200000) 01F1DVTA58196Q7BHY9V55P2CT (min time: 1616436000000, max time: 1616443200000) 01F1DVVQGW4CQ25FF96VR9CCMK (min time: 1616436000000, max time: 1616443200000)]","ts":"2021-05-14T17:24:24.213035413Z"}
{"caller":"blocks_cleaner.go:139","component":"cleaner","level":"info","msg":"started blocks cleanup and maintenance","ts":"2021-05-14T17:36:54.340453393Z"}
{"caller":"blocks_cleaner.go:294","component":"cleaner","level":"info","msg":"started blocks cleanup and maintenance","org_id":"platform","ts":"2021-05-14T17:36:54.945467206Z"}
{"caller":"blocks_cleaner.go:299","component":"cleaner","duration":"3.383828972s","level":"info","msg":"completed blocks cleanup and maintenance","org_id":"platform","ts":"2021-05-14T17:36:58.329296413Z"}
{"caller":"blocks_cleaner.go:143","component":"cleaner","level":"info","msg":"successfully completed blocks cleanup and maintenance","ts":"2021-05-14T17:36:58.329355039Z"}

This is from cortex-compactor-0 -- Compactors 1,2,3 usually don't have much in the way of logs other than: "success" type messages.

* Are you using persistent disks for store-gateways?

Yes, we're using PVCs with 2TiB of space and usually sit around 62% usage.

We're pushing between 1.2 and 1.5 million metrics per second with a replication factor of 3. Are we pushing Cortex too far?

Not really. This is a fairly small scale for Cortex.

That's good to know.

May 14 '21 17:05 cabrinha

a little over 38,000 -- I run compactors with 5 replicas. 0 reports 38,000 and 4 reports 380.

ith the default config, we compact blocks up to 24h so, after compaction, you should have 1 block per day. The number of compacted blocks depends on how many tenants you have in your cluster, but 38k looks suspicious to me, unless you have a very large number of tenants. I would suggest you to investigate which tenant has the highest number of blocks and check if the compactor can catch up with the compaction.

May 31 '21 15:05 pracucci

a little over 38,000 -- I run compactors with 5 replicas. 0 reports 38,000 and 4 reports 380.

with the default config, we compact blocks up to 24h so, after compaction, you should have 1 block per day. The number of compacted blocks depends on how many tenants you have in your cluster, but 38k looks suspicious to me, unless you have a very large number of tenants. I would suggest you to investigate which tenant has the highest number of blocks and check if the compactor can catch up with the compaction.

Looks like compactions were failing for our largest tenant, causing a slew of problems.

We're now alerting on block with not healthy index found as this halts compactions for the tenant until the block is deleted.

Jun 01 '21 16:06 cabrinha

Looks like compactions were failing for our largest tenant, causing a slew of problems.

With regards to this issue, is there any open issue left or can we close it?

Jun 03 '21 13:06 pracucci

Closed.

Jun 04 '21 03:06 cabrinha

I'm going to re-open this issue because it seems that the default memcached settings for index_cache aren't as good as they could be.

I found this issue from last year, Thanos: https://github.com/thanos-io/thanos/issues/1979

It seems like there are some suggestions around what to set memcached_config settings at, but I'm not sure anyone reported back a working example.

I'd like to review the defaults for each cache, make sure the defaults are setup to handle what Cortex requires and then provide an example of handling more load.

The current index-cache and metadata-cache pages lack working examples of a performant configuration.

Sep 29 '21 18:09 cabrinha

Even switching to inmemory, I still get errors when trying to run a long query and these errors point to memcached?

Cortex config:

  config:
    auth_enabled: true
    blocks_storage:
      s3:
        endpoint: ""
        bucket_name: ""
      bucket_store:
        index_header_lazy_loading_enabled: true
        bucket_index:
          enabled: true
        chunks_cache:
          backend: memcached
          memcached:
            addresses: dnssrvnoa+_memcache._tcp.cortex-memcached
            timeout: 10s
        index_cache:
          backend: inmemory
          inmemory:
            # 8Gi of memory for index cache
            max_size_bytes: 8589934592

{"caller":"memcached_client.go:387","err":"failed to wait for turn. Instance: : context canceled","firstKey":"P:01FGC2H8Y1WPSQJRR8F1SC91JS:XnzORFoKYmo5VHpR8YrJhA_FQF1GVONwpDV1aHLAskc","level":"warn","msg":"failed to fetch items from memcached","name":"index-cache","numKeys":4,"ts":"2021-09-29T23:47:48.429128553Z"}

Sep 30 '21 17:09 cabrinha

@cabrinha I have similar issue. did you solved it?

Oct 26 '21 09:10 hansh0801

@cabrinha I have similar issue. did you solved it?

I'm still fairly certain that if you have any tenant with a large number of uncompacted blocks, large queries can crash the store-gateways. I'm not sure how to prove this.

Oct 26 '21 16:10 cabrinha

@cabrinha I checked compactor, but there is no issue with compaction. when I try to query large size query, there is always crash in store-gateway showing memcache time out and context headline exceed. I checked memcached metric, but there were no issues. did you configured query frontend scheduler? we don't have this component, so I try to configure it.

i'll post some error messages tomorrow for debug this issue.

Oct 26 '21 17:10 hansh0801

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

Jan 26 '22 07:01 stale[bot]

I am facing this same issue. When I run queries with a long duration i.e. queries >= 2 days, the store-gateways become unhealthy and the queries fail with the following logs.

[cortex-infra03-store-gateway-0] level=error ts=2022-02-05T13:20:51.953964508Z caller=client.go:241 msg="error getting path" key=collectors/store-gateway err="Get \"http://consul-infra03-consul-server.cortex.svc.cluster.local:8500/v1/kv/collectors/store-gateway?index=25381&stale=&wait=10000ms\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
[cortex-infra03-store-gateway-4] level=error ts=2022-02-05T13:20:52.039745418Z caller=client.go:241 msg="error getting path" key=collectors/store-gateway err="Get \"http://consul-infra03-consul-server.cortex.svc.cluster.local:8500/v1/kv/collectors/store-gateway?index=25381&stale=&wait=10000ms\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
[cortex-infra03-store-gateway-1] level=error ts=2022-02-05T13:20:58.368233783Z caller=client.go:241 msg="error getting path" key=collectors/store-gateway err="Get \"http://consul-infra03-consul-server.cortex.svc.cluster.local:8500/v1/kv/collectors/store-gateway?index=25381&stale=&wait=10000ms\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
[cortex-infra03-store-gateway-2] level=warn ts=2022-02-05T13:21:00.509296919Z caller=grpc_logging.go:55 method=/gatewaypb.StoreGateway/Series duration=138.245µs err="failed to wait for turn: context canceled" msg="gRPC\n"
[cortex-infra03-store-gateway-2] level=warn ts=2022-02-05T13:21:00.628533562Z caller=grpc_logging.go:55 method=/gatewaypb.StoreGateway/Series duration=17.46441443s err="rpc error: code = Aborted desc = fetch series for block 01FTK5M869239D58YR1SH23TGG: expanded matching posting: get postings: read postings range: get range reader: Get \"https://eu01-infra03-cortex.s3.dualstack.eu-west-1.amazonaws.com/logistics/01FTK5M869239D58YR1SH23TGG/index\": context canceled" msg="gRPC\n"
[cortex-infra03-store-gateway-3] level=warn ts=2022-02-05T13:21:02.617874325Z caller=grpc_logging.go:55 method=/gatewaypb.StoreGateway/Series duration=16.472239539s err="rpc error: code = Aborted desc = fetch series for block 01FTK5KFYPE25RARWDCK8D207D: expanded matching posting: get postings: read postings range: get range reader: Get \"https://eu01-infra03-cortex.s3.dualstack.eu-west-1.amazonaws.com/logistics/01FTK5KFYPE25RARWDCK8D207D/index\": context canceled" msg="gRPC\n"
[cortex-infra03-store-gateway-4] level=warn ts=2022-02-05T13:21:03.055869414Z caller=grpc_logging.go:55 method=/gatewaypb.StoreGateway/Series duration=18.095855025s err="rpc error: code = Aborted desc = fetch series for block 01FTK5MYDJ63R1CJ5P4CSNX1NQ: expanded matching posting: get postings: read postings range: get range reader: Get \"https://eu01-infra03-cortex.s3.dualstack.eu-west-1.amazonaws.com/logistics/01FTK5MYDJ63R1CJ5P4CSNX1NQ/index\": context canceled" msg="gRPC\n"

Feb 05 '22 13:02 goelankitt

Same issue. Adding more RAM to the Gateway seems to help to some extent only.

Jun 01 '22 11:06 vbrinza

@vbrinza @goelankitt try looking at your compacted blocks and make sure your compactor is running properly. If you see many old blocks with a compaction level of just 1, your compactor may not be working properly.

Jun 22 '22 16:06 cabrinha

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

Sep 21 '22 03:09 stale[bot]

cortex cortex copied to clipboard

store-gateway pods lose connection to memcached and end up crashing

cortex
cortex copied to clipboard