cortex
cortex copied to clipboard
store-gateway pods lose connection to memcached and end up crashing
Describe the bug store-gateway pods will fail to connect to memcached with an i/o timeout error as seen in: https://github.com/thanos-io/thanos/issues/1979
{"caller":"memcached_client.go:382","err":"read tcp 192.168.68.132:59858->10.232.216.84:11211: i/o timeout","firstKey":"P:01F4AB8BMEH03WEXNMY516NMC8:mauaq0QeA5VHuyjZHSsF5P_DU_MPW7BLqAbRy42_Z2I","level":"warn","msg":"failed to fetch items from memcached","numKeys":2,"ts":"2021-04-27T23:27:10.227890794Z"}
To Reproduce Steps to reproduce the behavior:
- Start Cortex (1.8.0)
- Perform Operations(Read/Write/Others)
- Ingest 1.1 million metrics per second
- run store-gateway as a stateful set hooked up to s3
- wait for them to fail by losing connection to memcached
Expected behavior I expect store-gateway pods to run for a long time without crashing
Environment:
- Infrastructure: Kubernets / EKS / s3
- Deployment tool: helm
Storage Engine
- [x] Blocks
- [ ] Chunks
Additional Context
An error connecting to memcached or an i/o timeout doesn't compromise the correct functioning of Cortex but can have a negative impact on query performances. That being said, temporarily and sporadic errors can happen. If the issue is persistent, than I would suggest you to investigate it. Some inputs:
- Try to increase the memcached client timeout
- Ensure memcached is scaled up correctly (eg. doesn't hit the max connections limits in memcached server)
- Scale up memcached to more replicas
An error connecting to memcached or an i/o timeout doesn't compromise the correct functioning of Cortex but can have a negative impact on query performances. That being said, temporarily and sporadic errors can happen. If the issue is persistent, than I would suggest you to investigate it. Some inputs:
- Try to increase the memcached client timeout
- Ensure memcached is scaled up correctly (eg. doesn't hit the max connections limits in memcached server)
- Scale up memcached to more replicas
It's not just query performance that is impacted, it's actual queries:
"expanding series: failed to get store-gateway replication set owning the block 01F4AB90AE4ZS92FN8T8RMT5JW: at least 1 healthy replica required, could only find 0"
It doesn't look like our ElastiCache memcached nodes are under stress in the form of connections. We're running 4 nodes currently.
The NewConnections
metric shows we're not hitting the limit of 65,000 connections yet:
I'll try your other suggestions as well and report back here.
I've increased the memcached timeouts in the cortex configs to 60s.
Also doubled my memcached instances from 4 to 8. We're still not approaching the limits on connections there. ElastiCache instance types are cache.r6g.xlarge
.
Here are some more logs, hoping the help a dev identify what the issue could be. These logs are common and spew out in no particular order.
{"caller":"gateway.go:315","level":"info","msg":"successfully synchronized TSDB blocks for all users","reason":"periodic","ts":"2021-05-03T16:44:20.450207989Z"}
{"caller":"memcached_client.go:382","err":"failed to wait for turn: context canceled","firstKey":"P:01F4QVV8HT4YRY7QGFX5C49TGM:Jlm4f_irgzioCNF9dEqEkKppNxmXQFiV9P1lsOgYN9g","level":"warn","msg":"failed to fetch items from memcached","numKeys":2,"ts":"2021-05-03T16:48:50.501378775Z"}\
{"addresses":"dns+us-west-2-x-cortex-memcached.x.cfg.usw2.cache.amazonaws.com:11211","caller":"memcached_client.go:560","err":"lookup IP addresses \"us-west-2-x-cortex-memcached.x.cfg.usw2.cache.amazonaws.com\": i/o timeout","level":"error","msg":"failed to resolve addresses for memcached","ts":"2021-05-03T16:50:53.613230819Z"}
{"caller":"client.go:234","err":"Get \"http://consul-server.cortex.svc.cluster.local:8500/v1/kv/collectors/store-gateway?index=54925797&stale=&wait=10000ms\": context canceled","key":"collectors/store-gateway","level":"error","msg":"error getting path","ts":"2021-05-03T16:50:59.945392587Z"}
{"caller":"gateway.go:313","err":"failed to synchronize TSDB blocks for user platform: read bucket index: read bucket index: Get \"https://us-west-2-x-cortex-blocks.s3.dualstack.us-west-2.amazonaws.com/platform/bucket-index.json.gz\": context canceled","level":"warn","msg":"failed to synchronize TSDB blocks","reason":"ring-change","ts":"2021-05-03T16:51:02.894971377Z"}
{"caller":"grpc_logging.go:55","duration":"3m1.534076921s","err":"rpc error: code = Aborted desc = fetch series for block 01F4RY5G990HW47TFPGEFV5XXY: expanded matching posting: get postings: read postings range: get range reader: Get \"https://us-west-2-x-cortex-blocks.s3.dualstack.us-west-2.amazonaws.com/platform/01F4RY5G990HW47TFPGEFV5XXY/index\": context canceled","level":"warn","method":"/gatewaypb.StoreGateway/Series","msg":"gRPC\n","ts":"2021-05-03T16:51:34.361004474Z"}
{"caller":"basic_lifecycler.go:441","err":"failed to CAS collectors/store-gateway","level":"warn","msg":"failed to heartbeat instance in the ring","ring":"store-gateway","ts":"2021-05-03T16:50:57.781708966Z"}
{"caller":"basic_lifecycler.go:197","level":"info","msg":"ring lifecycler is shutting down","ring":"store-gateway","ts":"2021-05-03T16:50:57.781725181Z"}
Currently using c5n.4xlarge
instance types for store-gateway pods, and we've got 24 of them. Each instance shares about 4% of the ring ownership.
After a store-gateway pod crashes with a slurry of the errors above, it takes an hour or more of "loaded new block" and around 1TB of disk space before it posts ready status again. Network activity on the EC2 instance tops out around 17Gbps while it's downloading blocks.
index_header_lazy_loading_enabled
is set to true
.
Under blocks_storage
section of the config the chunks_cache
, index_cache
and metadata_cache
all use the "dns service discovery" settings of memcached.addresses: dns+<my ElastiCache memcached address>
, which is listed in the docs as experimental, but no other options are given.
It's hard to believe this is just a DNS issue, because this is running entirely in AWS, without creating any extra records for memcached. We tried installing node-local-dns on the cluster but it didn't seem to help.
We're pushing between 1.2 and 1.5 million metrics per second with a replication factor of 3. Are we pushing Cortex too far? What other components need to be scaled up or looked at?
I'm happy to share more configs or logs.
I would like it if store-gateway pods took less time to start up and if there was something I could look into to uncover the mystery behind dns lookup failures and connection timeouts. I'm suspecting that I'm pushing the whole system just a bit too hard but I'm not sure where to look for evidence. Could Jaeger be of any help here?
"expanding series: failed to get store-gateway replication set owning the block 01F4AB90AE4ZS92FN8T8RMT5JW: at least 1 healthy replica required, could only find 0"
This error means that there are not enough healthy store-gateways in the ring (eg. they crashed? OOMKilled?). This error should not be related in any way to memcached.
Here are some more logs, hoping the help a dev identify what the issue could be. These logs are common and spew out in no particular order. [...]
The error you posted above are all about context canceled
. It means the store-gateway was shutting down and we can also see it from the log "ring lifecycler is shutting down".
After a store-gateway pod crashes with a slurry of the errors above, it takes an hour or more of "loaded new block" and around 1TB of disk space before it posts ready status again. Network activity on the EC2 instance tops out around 17Gbps while it's downloading blocks.
- How many blocks do you have in the storage? You can look at the
cortex_bucket_blocks_count
metric is exported by compactor - Is the compactor running? This is essential to reduce the number of blocks
- Are you using persistent disks for store-gateways?
We're pushing between 1.2 and 1.5 million metrics per second with a replication factor of 3. Are we pushing Cortex too far?
Not really. This is a fairly small scale for Cortex.
* How many blocks do you have in the storage? You can look at the `cortex_bucket_blocks_count` metric is exported by compactor
a little over 38,000 -- I run compactors with 5 replicas. 0 reports 38,000 and 4 reports 380.
* Is the compactor running? This is essential to reduce the number of blocks
Compactor is running and has no restarts. Logs are filled with:
{"caller":"compact.go:953","component":"compactor","level":"info","msg":"start of compactions","org_id":"platform","ts":"2021-05-14T17:24:23.963826675Z"}
{"caller":"compact.go:705","component":"compactor","group":"0@{__org_id__=\"platform\"}","groupKey":"0@17918867045866434399","level":"info","msg":"compaction available and planned; downloading blocks","org_id":"platform","plan":"[01F1DVV22RN260G4DFYYXYNWBG (min time: 1616436000000, max time: 1616443200000) 01F1DVTGREEFTPDK0NMW7N25EB (min time: 1616436000000, max time: 1616443200000) 01F1DVTQA0NW61SH3CXRBBWY06 (min time: 1616436000000, max time: 1616443200000) 01F1DVVJ1V333BV9VEKQMACYHJ (min time: 1616436000000, max time: 1616443200000) 01F1DVTE1103Q7ND7V8NZJ4GHK (min time: 1616436000000, max time: 1616443200000) 01F1DVVMNTNS7XZMZ56HW8V4VR (min time: 1616436000000, max time: 1616443200000) 01F1DVVE9D74510769AFQT6RR2 (min time: 1616436000000, max time: 1616443200000) 01F1DVVBGGV0JGWXFWABFT6TF4 (min time: 1616436000000, max time: 1616443200000) 01F1DVV6QQ9J7KQRRMZ533918Q (min time: 1616436000000, max time: 1616443200000) 01F1DVVNWDKRWVGFS7VBAG0N6E (min time: 1616436000000, max time: 1616443200000) 01F1DVT0D00JEXWC5GMJ37RA00 (min time: 1616436000000, max time: 1616443200000) 01F1DVT584RGWMVD9Z23BWH7D2 (min time: 1616436000000, max time: 1616443200000) 01F1DVV12ABNC0E1ZVS34C7M3M (min time: 1616436000000, max time: 1616443200000) 01F1DVTCH5GTV37E83X0GP08HB (min time: 1616436000000, max time: 1616443200000) 01F1DVVREQWNNNZ3D5YD1GBZ08 (min time: 1616436000000, max time: 1616443200000) 01F1DVT16VTDBPTRBT2ECDB210 (min time: 1616436000000, max time: 1616443200000) 01F1DVVAAFRP05P3MGAR3RZAFM (min time: 1616436000000, max time: 1616443200000) 01F1DVVM5Y7RJXV71MWBC7AQ1R (min time: 1616436000000, max time: 1616443200000) 01F1JM25FKWGXZ627VB5GRP17S (min time: 1616436000000, max time: 1616443200000) 01F1DVTH9D1SZGGBE32WHV6DG3 (min time: 1616436000000, max time: 1616443200000) 01F1DVTTQ77GBR8FTPD2WN38V8 (min time: 1616436000000, max time: 1616443200000) 01F1DVTP929KHGK7ZYQ3ZXTM9H (min time: 1616436000000, max time: 1616443200000) 01F1DVTM8XFRTNJDNEPQW0MSWH (min time: 1616436000000, max time: 1616443200000) 01F1JKGZYMH7TW7HBQDFC1APKT (min time: 1616436000000, max time: 1616443200000) 01F1DVT3FGFBAP1V9VQ108S6V9 (min time: 1616436000000, max time: 1616443200000) 01F1DVTWB94HNNT57AV6FTYHX8 (min time: 1616436000000, max time: 1616443200000) 01F1DVTJEBXNKEER31QMQTJHB2 (min time: 1616436000000, max time: 1616443200000) 01F1DVTYN4ZJF98BFFEGYAF0RR (min time: 1616436000000, max time: 1616443200000) 01F1DVVJBKA0PF779HS6Q9PZ0Z (min time: 1616436000000, max time: 1616443200000) 01F1DVVGHGA145SMPMRNTBJ8NP (min time: 1616436000000, max time: 1616443200000) 01F1DVT1GQBKDCF30F770VPQ7C (min time: 1616436000000, max time: 1616443200000) 01F1DVVRV8QE25ANDC8HXFGZRW (min time: 1616436000000, max time: 1616443200000) 01F1DVTTWJE9J1JHJVE1X0B2HR (min time: 1616436000000, max time: 1616443200000) 01F1DVV09D3T6JPCQD8JXP8S2Y (min time: 1616436000000, max time: 1616443200000) 01F1DVTPZE517RZAREB6SWBM7A (min time: 1616436000000, max time: 1616443200000) 01F1JK81YZAB2PSZ4WNYVT25RF (min time: 1616436000000, max time: 1616443200000) 01F1DVTR00323BPWBZKN7E8T06 (min time: 1616436000000, max time: 1616443200000) 01F1DVT2S7FWSSANWT23HSPYQX (min time: 1616436000000, max time: 1616443200000) 01F1DVVA1FH2YPR465C2CNHPQX (min time: 1616436000000, max time: 1616443200000) 01F1JMAZGX15BB55AMAZBY1YE7 (min time: 1616436000000, max time: 1616443200000) 01F1DVV9Y7XYBCYC0ST08077F2 (min time: 1616436000000, max time: 1616443200000) 01F1DVTE9Q5KDKRA55CVFZ9DV1 (min time: 1616436000000, max time: 1616443200000) 01F1DVT16BGB9SPFTZC38SDMWW (min time: 1616436000000, max time: 1616443200000) 01F1DVTH634EWYFXTT66DTMJTV (min time: 1616436000000, max time: 1616443200000) 01F1DVV7K3KHDTBZTGSFDRBCXB (min time: 1616436000000, max time: 1616443200000) 01F1DVVDJN2BSPTAVGZQJKY16T (min time: 1616436000000, max time: 1616443200000) 01F1DVX5RKCFMYWSB64KMMT3P9 (min time: 1616436000000, max time: 1616443200000) 01F1DVV7NG0M30QX714GQ82DM2 (min time: 1616436000000, max time: 1616443200000) 01F1DVVK0CJWV9GSZ60W6VCBPE (min time: 1616436000000, max time: 1616443200000) 01F1DVSYPMZF20A8ASCA0AK7VR (min time: 1616436000000, max time: 1616443200000) 01F1DVTRM4M5NY0EDBD9Q7G02J (min time: 1616436000000, max time: 1616443200000) 01F1DVTA384SATTDSQDJTZXMVM (min time: 1616436000000, max time: 1616443200000) 01F1DVTZT1Q64B2F7T6Q6CH422 (min time: 1616436000000, max time: 1616443200000) 01F1DVV7YP5DDSHQB05EZ7C74D (min time: 1616436000000, max time: 1616443200000) 01F1DVVDN1F26KAAHQ503FWNVQ (min time: 1616436000000, max time: 1616443200000) 01F1JKSWJGV82DADX2SFPYH1N4 (min time: 1616436000000, max time: 1616443200000) 01F1DVT70VPJ9PS8EBVWXHKMFF (min time: 1616436000000, max time: 1616443200000) 01F1DVSXRABHVKTHSVAC7PXXSJ (min time: 1616436000000, max time: 1616443200000) 01F1DVTA58196Q7BHY9V55P2CT (min time: 1616436000000, max time: 1616443200000) 01F1DVVQGW4CQ25FF96VR9CCMK (min time: 1616436000000, max time: 1616443200000)]","ts":"2021-05-14T17:24:24.213035413Z"}
{"caller":"blocks_cleaner.go:139","component":"cleaner","level":"info","msg":"started blocks cleanup and maintenance","ts":"2021-05-14T17:36:54.340453393Z"}
{"caller":"blocks_cleaner.go:294","component":"cleaner","level":"info","msg":"started blocks cleanup and maintenance","org_id":"platform","ts":"2021-05-14T17:36:54.945467206Z"}
{"caller":"blocks_cleaner.go:299","component":"cleaner","duration":"3.383828972s","level":"info","msg":"completed blocks cleanup and maintenance","org_id":"platform","ts":"2021-05-14T17:36:58.329296413Z"}
{"caller":"blocks_cleaner.go:143","component":"cleaner","level":"info","msg":"successfully completed blocks cleanup and maintenance","ts":"2021-05-14T17:36:58.329355039Z"}
This is from cortex-compactor-0 -- Compactors 1,2,3 usually don't have much in the way of logs other than: "success" type messages.
* Are you using persistent disks for store-gateways?
Yes, we're using PVCs with 2TiB of space and usually sit around 62% usage.
We're pushing between 1.2 and 1.5 million metrics per second with a replication factor of 3. Are we pushing Cortex too far?
Not really. This is a fairly small scale for Cortex.
That's good to know.
a little over 38,000 -- I run compactors with 5 replicas. 0 reports 38,000 and 4 reports 380.
ith the default config, we compact blocks up to 24h so, after compaction, you should have 1 block per day. The number of compacted blocks depends on how many tenants you have in your cluster, but 38k looks suspicious to me, unless you have a very large number of tenants. I would suggest you to investigate which tenant has the highest number of blocks and check if the compactor can catch up with the compaction.
a little over 38,000 -- I run compactors with 5 replicas. 0 reports 38,000 and 4 reports 380.
with the default config, we compact blocks up to 24h so, after compaction, you should have 1 block per day. The number of compacted blocks depends on how many tenants you have in your cluster, but 38k looks suspicious to me, unless you have a very large number of tenants. I would suggest you to investigate which tenant has the highest number of blocks and check if the compactor can catch up with the compaction.
Looks like compactions were failing for our largest tenant, causing a slew of problems.
We're now alerting on block with not healthy index found
as this halts compactions for the tenant until the block is deleted.
Looks like compactions were failing for our largest tenant, causing a slew of problems.
With regards to this issue, is there any open issue left or can we close it?
Closed.
I'm going to re-open this issue because it seems that the default memcached settings for index_cache aren't as good as they could be.
I found this issue from last year, Thanos: https://github.com/thanos-io/thanos/issues/1979
It seems like there are some suggestions around what to set memcached_config settings at, but I'm not sure anyone reported back a working example.
I'd like to review the defaults for each cache, make sure the defaults are setup to handle what Cortex requires and then provide an example of handling more load.
The current index-cache and metadata-cache pages lack working examples of a performant configuration.
Even switching to inmemory
, I still get errors when trying to run a long query and these errors point to memcached?
Cortex config:
config:
auth_enabled: true
blocks_storage:
s3:
endpoint: ""
bucket_name: ""
bucket_store:
index_header_lazy_loading_enabled: true
bucket_index:
enabled: true
chunks_cache:
backend: memcached
memcached:
addresses: dnssrvnoa+_memcache._tcp.cortex-memcached
timeout: 10s
index_cache:
backend: inmemory
inmemory:
# 8Gi of memory for index cache
max_size_bytes: 8589934592
{"caller":"memcached_client.go:387","err":"failed to wait for turn. Instance: : context canceled","firstKey":"P:01FGC2H8Y1WPSQJRR8F1SC91JS:XnzORFoKYmo5VHpR8YrJhA_FQF1GVONwpDV1aHLAskc","level":"warn","msg":"failed to fetch items from memcached","name":"index-cache","numKeys":4,"ts":"2021-09-29T23:47:48.429128553Z"}
@cabrinha I have similar issue. did you solved it?
@cabrinha I have similar issue. did you solved it?
I'm still fairly certain that if you have any tenant with a large number of uncompacted blocks, large queries can crash the store-gateways. I'm not sure how to prove this.
@cabrinha I checked compactor, but there is no issue with compaction. when I try to query large size query, there is always crash in store-gateway showing memcache time out and context headline exceed. I checked memcached metric, but there were no issues. did you configured query frontend scheduler? we don't have this component, so I try to configure it.
i'll post some error messages tomorrow for debug this issue.
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.
I am facing this same issue. When I run queries with a long duration i.e. queries >= 2 days, the store-gateways become unhealthy and the queries fail with the following logs.
[cortex-infra03-store-gateway-0] level=error ts=2022-02-05T13:20:51.953964508Z caller=client.go:241 msg="error getting path" key=collectors/store-gateway err="Get \"http://consul-infra03-consul-server.cortex.svc.cluster.local:8500/v1/kv/collectors/store-gateway?index=25381&stale=&wait=10000ms\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
[cortex-infra03-store-gateway-4] level=error ts=2022-02-05T13:20:52.039745418Z caller=client.go:241 msg="error getting path" key=collectors/store-gateway err="Get \"http://consul-infra03-consul-server.cortex.svc.cluster.local:8500/v1/kv/collectors/store-gateway?index=25381&stale=&wait=10000ms\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
[cortex-infra03-store-gateway-1] level=error ts=2022-02-05T13:20:58.368233783Z caller=client.go:241 msg="error getting path" key=collectors/store-gateway err="Get \"http://consul-infra03-consul-server.cortex.svc.cluster.local:8500/v1/kv/collectors/store-gateway?index=25381&stale=&wait=10000ms\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
[cortex-infra03-store-gateway-2] level=warn ts=2022-02-05T13:21:00.509296919Z caller=grpc_logging.go:55 method=/gatewaypb.StoreGateway/Series duration=138.245µs err="failed to wait for turn: context canceled" msg="gRPC\n"
[cortex-infra03-store-gateway-2] level=warn ts=2022-02-05T13:21:00.628533562Z caller=grpc_logging.go:55 method=/gatewaypb.StoreGateway/Series duration=17.46441443s err="rpc error: code = Aborted desc = fetch series for block 01FTK5M869239D58YR1SH23TGG: expanded matching posting: get postings: read postings range: get range reader: Get \"https://eu01-infra03-cortex.s3.dualstack.eu-west-1.amazonaws.com/logistics/01FTK5M869239D58YR1SH23TGG/index\": context canceled" msg="gRPC\n"
[cortex-infra03-store-gateway-3] level=warn ts=2022-02-05T13:21:02.617874325Z caller=grpc_logging.go:55 method=/gatewaypb.StoreGateway/Series duration=16.472239539s err="rpc error: code = Aborted desc = fetch series for block 01FTK5KFYPE25RARWDCK8D207D: expanded matching posting: get postings: read postings range: get range reader: Get \"https://eu01-infra03-cortex.s3.dualstack.eu-west-1.amazonaws.com/logistics/01FTK5KFYPE25RARWDCK8D207D/index\": context canceled" msg="gRPC\n"
[cortex-infra03-store-gateway-4] level=warn ts=2022-02-05T13:21:03.055869414Z caller=grpc_logging.go:55 method=/gatewaypb.StoreGateway/Series duration=18.095855025s err="rpc error: code = Aborted desc = fetch series for block 01FTK5MYDJ63R1CJ5P4CSNX1NQ: expanded matching posting: get postings: read postings range: get range reader: Get \"https://eu01-infra03-cortex.s3.dualstack.eu-west-1.amazonaws.com/logistics/01FTK5MYDJ63R1CJ5P4CSNX1NQ/index\": context canceled" msg="gRPC\n"
Same issue. Adding more RAM to the Gateway seems to help to some extent only.
@vbrinza @goelankitt try looking at your compacted blocks and make sure your compactor is running properly. If you see many old blocks with a compaction level of just 1, your compactor may not be working properly.
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.