cortex
cortex copied to clipboard
Cortex get lots of error every 2 hours
Describe the bug We are monitoring our error code return from Cortex. We realize that every 2 hours, the error rate is very high and delayed our data. I don't think this is because of remoteWrite protocol. Is it Cortex job or something?
Environment:
- Infrastructure: Kubernetes
- Deployment tool: helm chart
Storage Engine
- [x] Blocks
Can you share the logs from distributor or the responde body? This may give some clues
@alanprot it's almost code 500, and code 400. I can't find more helpful logs.
It's timed-out and out-of-order sample errors, but why does it only happens every 2 hours?
Every 2 hours (assuming 2 hours blocks) ingester should be comparing the head (creating a new block).. this may cause ingester to uses more resource (cpu,disk) and maybe timeouts ? Can u check ingesters cpu during that time ?
Every 2 hours (assuming 2 hours blocks) ingester should be comparing the head (creating a new block).. this may cause ingester to uses more resource (cpu,disk) and maybe timeouts ? Can u check ingesters cpu during that time ?
One of theme has high cpu usage (4cores). I also increase it but after cortex restart, error rates increase and never decrease. I'm unable to recover the cluster, data always delay. (image 2).
Is there any trick for tuning Cortex cluster, or any monitoring tools for Cortex?
From your logs, it's visible distributors are timing out (500) pushing. That means distributor is trying to push to an ingester that is not here. Are the ingesters restarting? are they running out of memory? That would explain high cpu usage after restart and the timeouts. A restarted ingester needs to manually be removed from the ring (forget).
@sockyone for monitoring typically you setup a prometheus to scrape all components and you use the cortex-mixin which includes alerts and dashboards. It's easier to debug problems when you have typical alerts configured.
@friedrich-at-adobe I gave my ingester like ~24GiB Ram and I have 12 ingesters, only for 10Mil time-series. It can not run out of memory. Memory was fine, CPU didn't reach the limit, but I still faced a high rate of errors every 2hours. How can I get over this peak time better?
Something I ran into was networking bottlenecks within the Kubernetes cluster I was running into. Updating to a faster CNI (instead of kube-proxy) helped solve my "every 2 hours my distributor breaks" problem
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.