cortex Cortex get lots of error every 2 hours

Describe the bug We are monitoring our error code return from Cortex. We realize that every 2 hours, the error rate is very high and delayed our data. I don't think this is because of remoteWrite protocol. Is it Cortex job or something?

Environment:

Infrastructure: Kubernetes
Deployment tool: helm chart

Storage Engine

[x] Blocks

May 26 '22 06:05 sockyone

Can you share the logs from distributor or the responde body? This may give some clues

May 26 '22 07:05 alanprot

@alanprot it's almost code 500, and code 400. I can't find more helpful logs. It's timed-out and out-of-order sample errors, but why does it only happens every 2 hours?

May 26 '22 07:05 sockyone

Every 2 hours (assuming 2 hours blocks) ingester should be comparing the head (creating a new block).. this may cause ingester to uses more resource (cpu,disk) and maybe timeouts ? Can u check ingesters cpu during that time ?

May 26 '22 07:05 alanprot

Every 2 hours (assuming 2 hours blocks) ingester should be comparing the head (creating a new block).. this may cause ingester to uses more resource (cpu,disk) and maybe timeouts ? Can u check ingesters cpu during that time ?

One of theme has high cpu usage (4cores). I also increase it but after cortex restart, error rates increase and never decrease. I'm unable to recover the cluster, data always delay. (image 2). Is there any trick for tuning Cortex cluster, or any monitoring tools for Cortex?

May 26 '22 10:05 sockyone

From your logs, it's visible distributors are timing out (500) pushing. That means distributor is trying to push to an ingester that is not here. Are the ingesters restarting? are they running out of memory? That would explain high cpu usage after restart and the timeouts. A restarted ingester needs to manually be removed from the ring (forget).

@sockyone for monitoring typically you setup a prometheus to scrape all components and you use the cortex-mixin which includes alerts and dashboards. It's easier to debug problems when you have typical alerts configured.

Jun 01 '22 18:06 friedrich-at-adobe

@friedrich-at-adobe I gave my ingester like ~24GiB Ram and I have 12 ingesters, only for 10Mil time-series. It can not run out of memory. Memory was fine, CPU didn't reach the limit, but I still faced a high rate of errors every 2hours. How can I get over this peak time better?

Jun 13 '22 16:06 sockyone

Something I ran into was networking bottlenecks within the Kubernetes cluster I was running into. Updating to a faster CNI (instead of kube-proxy) helped solve my "every 2 hours my distributor breaks" problem

Jul 10 '22 22:07 TaylorMutch

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

Jan 07 '23 17:01 stale[bot]

cortex cortex copied to clipboard

Cortex get lots of error every 2 hours

cortex
cortex copied to clipboard