cortex icon indicating copy to clipboard operation
cortex copied to clipboard

Ingester throws 500s with nothing in the logs

Open jakubgs opened this issue 2 years ago • 14 comments

Describe the bug I'm seeing 500s from ingesters, mostly one of my ingesters:

image

The problem is, I see no errors or warnings - and my log level is at warn - at the the same time:

level=error ts=2022-08-10T07:38:28.43936122Z caller=push.go:51 org_id=fake msg="push error" err="rpc error: code = Code(400) desc = user=fake: err: out of order sample. timestamp=2022-08-10T07:37:19.29Z, series={__name__=\"process_cpu_seconds_total\", container=\"beacon-node-prater-stable\", datacenter=\"aws-eu-central-1a\", fleet=\"nimbus.prater\", group=\",nimbus.prater,beacon,nimbus,metrics,\", instance=\"stable-large-01.aws-eu-central-1a.nimbus.prater\", job=\"beacon-node-metrics\", source=\"slave-01.aws-eu-central-1a.metrics.hq\"}"
level=error ts=2022-08-10T07:43:48.26174533Z caller=push.go:51 org_id=fake msg="push error" err="rpc error: code = Code(400) desc = user=fake: err: out of order sample. timestamp=2022-08-10T07:43:19.384Z, series={__name__=\"validator_monitor_prev_epoch_exits_total_created\", container=\"beacon-node-prater-testing\", datacenter=\"aws-eu-central-1a\", fleet=\"nimbus.prater\", group=\",nimbus.prater,beacon,nimbus,metrics,\", instance=\"testing-large-01.aws-eu-central-1a.nimbus.prater\", job=\"beacon-node-metrics\", source=\"slave-01.aws-eu-central-1a.metrics.hq\", validator=\"total\"}"

Just two out of order sample errors on that day much later. Nothing else. So I have no idea why those 500s happened.

Expected behavior I'd expect to see errors or at least warning about why the 500s happened.

Environment:

  • Infrastructure: metal
  • Deployment tool: Ansible

Storage Engine

  • [x] Blocks

Additional Context I'm running most recent 1.13.0 Cortex and most recent 2.37.0 Prometheus.

The weird thing is that everything chugs long fine for 10-20 hours, and then suddenly explodes, without any reason:

image

And as far as I can tell neither the distributors or the ingesters are overloaded.

jakubgs avatar Aug 10 '22 08:08 jakubgs

The old or order samples should be a 4xx.

You dont have any other error on the distributor logs?

alanprot avatar Aug 11 '22 01:08 alanprot

The only two errors I found on the distributor were the two out of order sample I added in the issue description:

level=error ts=2022-08-10T07:38:28.43936122Z caller=push.go:51 org_id=fake msg="push error" err="rpc error: code = Code(400) desc = user=fake: err: out of order sample. timestamp=2022-08-10T07:37:19.29Z, series={__name__=\"process_cpu_seconds_total\", container=\"beacon-node-prater-stable\", datacenter=\"aws-eu-central-1a\", fleet=\"nimbus.prater\", group=\",nimbus.prater,beacon,nimbus,metrics,\", instance=\"stable-large-01.aws-eu-central-1a.nimbus.prater\", job=\"beacon-node-metrics\", source=\"slave-01.aws-eu-central-1a.metrics.hq\"}"
level=error ts=2022-08-10T07:43:48.26174533Z caller=push.go:51 org_id=fake msg="push error" err="rpc error: code = Code(400) desc = user=fake: err: out of order sample. timestamp=2022-08-10T07:43:19.384Z, series={__name__=\"validator_monitor_prev_epoch_exits_total_created\", container=\"beacon-node-prater-testing\", datacenter=\"aws-eu-central-1a\", fleet=\"nimbus.prater\", group=\",nimbus.prater,beacon,nimbus,metrics,\", instance=\"testing-large-01.aws-eu-central-1a.nimbus.prater\", job=\"beacon-node-metrics\", source=\"slave-01.aws-eu-central-1a.metrics.hq\", validator=\"total\"}"

And the only warnings on the ingester I could find was this:

level=warn ts=2022-08-10T07:36:29.359491529Z caller=grpc_logging.go:38 method=/cortex.Ingester/Push duration=17.543642ms err="rpc error: code = Code(400) desc = user=fake: err: out of order sample. timestamp=2022-08-10T07:35:51.445Z, series={__name__=\"nim_gc_heap_instance_occupied_bytes\", container=\"beacon-node-prater-unstable\", datacenter=\"aws-eu-central-1a\", fleet=\"nimbus.prater\", group=\",nimbus.prater,beacon,nimbus,metrics,\", instance=\"unstable-large-01.aws-eu-central-1a.nimbus.prater\", job=\"beacon-node-metrics\", source=\"slave-01.aws-eu-central-1a.metrics.hq\", type_name=\"seq[Validator]\"}" msg="gRPC\n"
level=warn ts=2022-08-10T07:36:58.763925567Z caller=grpc_logging.go:38 method=/cortex.Ingester/Push duration=7.715986ms err="rpc error: code = Code(400) desc = user=fake: err: out of order sample. timestamp=2022-08-10T07:36:19.29Z, series={__name__=\"validator_monitor_prev_epoch_attestations_min_delay_seconds_bucket\", container=\"beacon-node-prater-stable\", datacenter=\"aws-eu-central-1a\", fleet=\"nimbus.prater\", group=\",nimbus.prater,beacon,nimbus,metrics,\", instance=\"stable-large-01.aws-eu-central-1a.nimbus.prater\", job=\"beacon-node-metrics\", le=\"7.5\", source=\"slave-01.aws-eu-central-1a.metrics.hq\", validator=\"total\"}" msg="gRPC\n"
level=warn ts=2022-08-10T07:36:58.970925766Z caller=grpc_logging.go:38 method=/cortex.Ingester/Push duration=8.095045ms err="rpc error: code = Code(400) desc = user=fake: err: out of order sample. timestamp=2022-08-10T07:36:19.29Z, series={__name__=\"nim_gc_heap_instance_occupied_bytes\", container=\"beacon-node-prater-stable\", datacenter=\"aws-eu-central-1a\", fleet=\"nimbus.prater\", group=\",nimbus.prater,beacon,nimbus,metrics,\", instance=\"stable-large-01.aws-eu-central-1a.nimbus.prater\", job=\"beacon-node-metrics\", source=\"slave-01.aws-eu-central-1a.metrics.hq\", type_name=\"seq[VoteTracker]\"}" msg="gRPC\n"
level=warn ts=2022-08-10T07:37:07.306325393Z caller=grpc_logging.go:38 method=/cortex.Ingester/Push duration=10.163365ms err="rpc error: code = Code(400) desc = user=fake: err: out of order sample. timestamp=2022-08-10T07:36:19.29Z, series={__name__=\"process_cpu_seconds_total\", container=\"beacon-node-prater-stable\", datacenter=\"aws-eu-central-1a\", fleet=\"nimbus.prater\", group=\",nimbus.prater,beacon,nimbus,metrics,\", instance=\"stable-large-01.aws-eu-central-1a.nimbus.prater\", job=\"beacon-node-metrics\", source=\"slave-01.aws-eu-central-1a.metrics.hq\"}" msg="gRPC\n"
level=warn ts=2022-08-10T07:38:28.43342672Z caller=grpc_logging.go:38 method=/cortex.Ingester/Push duration=8.06236ms err="rpc error: code = Code(400) desc = user=fake: err: out of order sample. timestamp=2022-08-10T07:37:19.29Z, series={__name__=\"process_cpu_seconds_total\", container=\"beacon-node-prater-stable\", datacenter=\"aws-eu-central-1a\", fleet=\"nimbus.prater\", group=\",nimbus.prater,beacon,nimbus,metrics,\", instance=\"stable-large-01.aws-eu-central-1a.nimbus.prater\", job=\"beacon-node-metrics\", source=\"slave-01.aws-eu-central-1a.metrics.hq\"}" msg="gRPC\n"
level=warn ts=2022-08-10T07:43:25.467252954Z caller=grpc_logging.go:38 method=/cortex.Ingester/Push duration=20.116283ms err="rpc error: code = Code(400) desc = user=fake: err: out of order sample. timestamp=2022-08-10T07:42:45.904Z, series={__name__=\"process_cpu_seconds_total\", container=\"rocketpool-eth2-node\", datacenter=\"aws-eu-central-1a\", fleet=\"rocket.prod\", group=\",rocket.prod,beacon,nimbus,metrics,\", instance=\"node-01.aws-eu-central-1a.rocket.prod\", job=\"beacon-node-metrics\", source=\"slave-01.aws-eu-central-1a.metrics.hq\"}" msg="gRPC\n"
level=warn ts=2022-08-10T07:43:46.646514911Z caller=grpc_logging.go:38 method=/cortex.Ingester/Push duration=7.680135ms err="rpc error: code = Code(400) desc = user=fake: err: out of order sample. timestamp=2022-08-10T07:43:19.384Z, series={__name__=\"validator_monitor_aggregated_attestation_delay_seconds_bucket\", container=\"beacon-node-prater-testing\", datacenter=\"aws-eu-central-1a\", fleet=\"nimbus.prater\", group=\",nimbus.prater,beacon,nimbus,metrics,\", instance=\"testing-large-01.aws-eu-central-1a.nimbus.prater\", job=\"beacon-node-metrics\", le=\"0.75\", source=\"slave-01.aws-eu-central-1a.metrics.hq\", src=\"api\", validator=\"total\"}" msg="gRPC\n"

Which is the same thing. It's all out of order sample, and the start is 07:36:29, when the errors started at 07:14:00:

image

With two small bumps before that too.

jakubgs avatar Aug 11 '22 07:08 jakubgs

Okay, I think I see what it is. One of the ingesters is starting to SWAP when this happens:

image

Which matches with the ingester failures jump:

image

jakubgs avatar Aug 11 '22 07:08 jakubgs

Actually, it JUST happened again on the same host:

image

image

The swap.free metric recovers becuase I restarted the ingester at 07:09:46.

So the cause appears to be found, but neither the distributor nor the ingester have any errors or warnings in the logs.

No idea if this is doable, but it would be definitely a good idea to print errors when ingesters fail to handle requests and return 500s with some information on why it happened.

jakubgs avatar Aug 11 '22 07:08 jakubgs

If the error causes a 5xx for the end user (distributor could not write the quorum) cortex prints the error .. if is just one ingester having problem it indeed does not print…

if swap is the problem probably the ingester became slow.. can h see the number of inflight requests on ingester (cortex have a metric for it) and see if it spiked as well during the event ?

alanprot avatar Aug 11 '22 15:08 alanprot

I can indeed:

image

If I didn't restart the ingester it would have eventually caused a massive increase in desired shards and hit max_shards and eventually lead to dropped metrics. I've already scaled the hosts up to have more memory, but I'd consider no errors in the logs to be a bug.

jakubgs avatar Aug 11 '22 19:08 jakubgs

Actually no, it's not resolved. It happened again, and this time it's definitely not memory/swap:

image

Big drop in request responses on distrobutors, and then a bit later 500s on one of the distributors:

image

Big jummp in in-flight requests for the ingester:

image

And with a delay jump in discared samples:

image

Both distributor hosts and ingester hosts are under-utilized in terms of CPU and memory, and YET after after 31 hours of working fine without issues it all suddenly explodes:

image

jakubgs avatar Aug 12 '22 07:08 jakubgs

Actually, I found something in logs this time:

ingest-01

ts=2022-08-12T01:04:06.77013867Z caller=client.go:155 msg="error getting key" key=collectors/ring err="Unexpected response code: 500 (rpc error making call: EOF)" level=error

ingest-02

ts=2022-08-12T01:04:05.818837201Z caller=client.go:155 msg="error getting key" key=collectors/ring err="Get \"http://localhost:8500/v1/kv/collectors/ring\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" level=error
ts=2022-08-12T01:04:58.150319929Z caller=client.go:155 msg="error getting key" key=collectors/ring err="Unexpected response code: 500 (rpc error making call: rpc error making call: No cluster leader)" level=error

master-01

ts=2022-08-12T01:04:02.642247257Z caller=client.go:242 msg="error getting path" key=collectors/ring err="Unexpected response code: 500 (rpc error making call: EOF)" level=error

master-02

ts=2022-08-12T01:09:40.981537136Z caller=client.go:242 msg="error getting path" key=collectors/ring err="Get \"http://localhost:8500/v1/kv/collectors/ring?index=98791502&stale=&wait=10000ms\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" level=error 
ts=2022-08-12T01:09:51.40172645Z caller=push.go:34 org_id=fake err="read tcp 10.10.0.73:9092->10.11.0.5:47574: i/o timeout" level=error 
ts=2022-08-12T01:09:51.517729845Z caller=push.go:34 org_id=fake err="read tcp 10.10.0.73:9092->10.12.0.5:46364: i/o timeout" level=error 

It seems like it couldn't reach the Consul KV store temporarily, and that started the whole thing. But why didn't it recover when Consul recovered...

jakubgs avatar Aug 12 '22 07:08 jakubgs

Yes, but this appears unrelated to the memory/swap problem. that originally made me create this issue

jakubgs avatar Aug 12 '22 07:08 jakubgs

It appears this time the issue was temporary connectivity issues in a Consul cluster causing lack of leader for about 2 minutes, but why would that alone cause Cortex to completely break for hours...

jakubgs avatar Aug 12 '22 07:08 jakubgs

Right now my ingesters are generating thousands of 500s per minute:

image

And yet there's no errors in the logs... How am I supposed to debug this?

jakubgs avatar Aug 14 '22 08:08 jakubgs

Ok, this is the only thing I can find on distributor nodes:

 > journalctl -au cortex | grep -v -e 'out of bounds' -e 'out of order' | tail
Aug 13 20:56:59 master-01.do-ams3.metrics.hq systemd[1]: Started "Cortex - Horizontally scalable, highly available, multi-tenant, long term Prometheus".
Aug 13 23:24:19 master-01.do-ams3.metrics.hq cortex[759187]: level=error ts=2022-08-13T23:24:19.812092776Z caller=push.go:34 org_id=fake err="unexpected EOF"
Aug 14 00:14:39 master-01.do-ams3.metrics.hq cortex[759187]: level=error ts=2022-08-14T00:14:39.831137927Z caller=push.go:34 org_id=fake err="unexpected EOF"
Aug 14 02:33:09 master-01.do-ams3.metrics.hq cortex[759187]: level=error ts=2022-08-14T02:33:09.568804309Z caller=push.go:34 org_id=fake err="unexpected EOF"
Aug 14 02:34:19 master-01.do-ams3.metrics.hq cortex[759187]: level=error ts=2022-08-14T02:34:19.710642918Z caller=push.go:34 org_id=fake err="unexpected EOF"
Aug 14 02:35:29 master-01.do-ams3.metrics.hq cortex[759187]: level=error ts=2022-08-14T02:35:29.749805626Z caller=push.go:34 org_id=fake err="unexpected EOF"
Aug 14 05:51:29 master-01.do-ams3.metrics.hq cortex[759187]: level=error ts=2022-08-14T05:51:29.452965937Z caller=push.go:34 org_id=fake err="unexpected EOF"
Aug 14 07:18:29 master-01.do-ams3.metrics.hq cortex[759187]: level=error ts=2022-08-14T07:18:29.745044269Z caller=push.go:34 org_id=fake err="unexpected EOF"
Aug 14 07:34:49 master-01.do-ams3.metrics.hq cortex[759187]: level=error ts=2022-08-14T07:34:49.506017965Z caller=push.go:34 org_id=fake err="unexpected EOF"
Aug 14 08:05:49 master-01.do-ams3.metrics.hq cortex[759187]: level=error ts=2022-08-14T08:05:49.596794488Z caller=push.go:34 org_id=fake err="unexpected EOF"

Not sure what that means, and those are quite sparse compared to the number of errors in metrics.

jakubgs avatar Aug 14 '22 08:08 jakubgs

And ingesters have literally no errors aside from the out of bounds and out of order ones:

 > journalctl -au cortex | grep -v -e 'out of bounds' -e 'out of order'
-- Logs begin at Sun 2022-08-14 01:58:15 UTC, end at Sun 2022-08-14 08:34:56 UTC. --

So how exactly can one figure out what's causing those 500s?

jakubgs avatar Aug 14 '22 08:08 jakubgs

I noticed that setting min_backoff and max_backoff to the same value(in my case 5 seconds) helps mitigate the 500s. My understanding is that setting both to the same value limits retries to just one, which avoid flooding the Cortex nodes with retried batches of samples.

jakubgs avatar Aug 16 '22 10:08 jakubgs

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jan 07 '23 17:01 stale[bot]