cortex icon indicating copy to clipboard operation
cortex copied to clipboard

Cortex get lots of error every 2 hours

Open sockyone opened this issue 2 years ago • 7 comments

Describe the bug We are monitoring our error code return from Cortex. We realize that every 2 hours, the error rate is very high and delayed our data. I don't think this is because of remoteWrite protocol. Is it Cortex job or something?

Environment:

  • Infrastructure: Kubernetes
  • Deployment tool: helm chart

Storage Engine

  • [x] Blocks image

sockyone avatar May 26 '22 06:05 sockyone

Can you share the logs from distributor or the responde body? This may give some clues

alanprot avatar May 26 '22 07:05 alanprot

@alanprot it's almost code 500, and code 400. I can't find more helpful logs. It's timed-out and out-of-order sample errors, but why does it only happens every 2 hours? image

sockyone avatar May 26 '22 07:05 sockyone

Every 2 hours (assuming 2 hours blocks) ingester should be comparing the head (creating a new block).. this may cause ingester to uses more resource (cpu,disk) and maybe timeouts ? Can u check ingesters cpu during that time ?

alanprot avatar May 26 '22 07:05 alanprot

Every 2 hours (assuming 2 hours blocks) ingester should be comparing the head (creating a new block).. this may cause ingester to uses more resource (cpu,disk) and maybe timeouts ? Can u check ingesters cpu during that time ?

One of theme has high cpu usage (4cores). I also increase it but after cortex restart, error rates increase and never decrease. I'm unable to recover the cluster, data always delay. (image 2). Is there any trick for tuning Cortex cluster, or any monitoring tools for Cortex? image image

sockyone avatar May 26 '22 10:05 sockyone

From your logs, it's visible distributors are timing out (500) pushing. That means distributor is trying to push to an ingester that is not here. Are the ingesters restarting? are they running out of memory? That would explain high cpu usage after restart and the timeouts. A restarted ingester needs to manually be removed from the ring (forget).

@sockyone for monitoring typically you setup a prometheus to scrape all components and you use the cortex-mixin which includes alerts and dashboards. It's easier to debug problems when you have typical alerts configured.

friedrich-at-adobe avatar Jun 01 '22 18:06 friedrich-at-adobe

@friedrich-at-adobe I gave my ingester like ~24GiB Ram and I have 12 ingesters, only for 10Mil time-series. It can not run out of memory. Memory was fine, CPU didn't reach the limit, but I still faced a high rate of errors every 2hours. How can I get over this peak time better?

sockyone avatar Jun 13 '22 16:06 sockyone

Something I ran into was networking bottlenecks within the Kubernetes cluster I was running into. Updating to a faster CNI (instead of kube-proxy) helped solve my "every 2 hours my distributor breaks" problem

TaylorMutch avatar Jul 10 '22 22:07 TaylorMutch

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jan 07 '23 17:01 stale[bot]