Metrics show append failures but logs show none
Describe the bug
I'm seeing lots of errors from cortex_distributor_ingester_append_failures_total metric on one of our distributors:
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.211:9095",status="4xx",type="samples"} 318
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.211:9095",status="5xx",type="metadata"} 14
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.211:9095",status="5xx",type="samples"} 1670
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.212:9095",status="4xx",type="samples"} 248
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.212:9095",status="5xx",type="metadata"} 13
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.212:9095",status="5xx",type="samples"} 1991
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.213:9095",status="4xx",type="samples"} 68
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.213:9095",status="5xx",type="samples"} 23041
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.214:9095",status="4xx",type="samples"} 128
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.214:9095",status="5xx",type="metadata"} 44
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.214:9095",status="5xx",type="samples"} 5642
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.218:9095",status="4xx",type="samples"} 97
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.218:9095",status="5xx",type="samples"} 36903
These errors can be seen in the graph:
But when I log onto affected distributor and ingester I cannot find any errors logged, even on debug level:
ts=2024-03-06T09:53:16.116085832Z caller=grpc_logging.go:46 level=debug method=/cortex.Ingester/Push duration=4.272868ms msg="gRPC (success)"
ts=2024-03-06T09:53:16.133388069Z caller=grpc_logging.go:46 level=debug method=/cortex.Ingester/Push duration=5.270535ms msg="gRPC (success)"
ts=2024-03-06T09:53:16.146464557Z caller=grpc_logging.go:46 level=debug method=/cortex.Ingester/Push duration=4.954335ms msg="gRPC (success)"
ts=2024-03-06T09:53:16.157419974Z caller=grpc_logging.go:46 level=debug method=/cortex.Ingester/Push duration=4.418061ms msg="gRPC (success)"
To Reproduce I have no idea.
Expected behavior Either Distributor or Ingester shows errors in the logs so the issue can be actually debugged.
Environment:
Prometheus 2.50.1 sending to Cortex 1.16.0.
As recommended by @friedrichg I made this query for affected ingester:
histogram_quantile(0.99,sum(rate(cortex_request_duration_seconds_bucket{container="ingester",route="/cortex.Ingester/Push"}[5m])) by (le,route,status_code))
But the ingester appeared to be healthy:
Then it was recommended to query this for the affected distributor:
histogram_quantile(0.99,sum(rate(cortex_request_duration_seconds_bucket{container="distributor"}[5m])) by (le,route,status_code))
And the result was:
Which shows 400 errors for push requests. But no 400 errors were logged by distributor set to debug log level.
Share
sum(rate(cortex_request_duration_seconds_bucket{container="distributor"}[5m])) by (route,status_code)
If you can
The only two warnings I managed to find were:
ts=2024-03-06T09:40:15.643029269Z caller=push.go:53 level=warn org_id=fake msg="push refused" err="rpc error: code = Code(400) desc = maxFailure (quorum) on a given error family, rpc error: code = Code(400) desc = addr=10.10.0.211:9095 state=ACTIVE zone=, rpc error: code = Code(400) desc = user=fake: err: duplicate sample for timestamp. timestamp=2024-03-06T09:40:09.183Z, series={__name__=\"consul_consul_members_clients\", datacenter=\"aws-eu-central-1a\", fleet=\"consul.hq\", group=\",consul.hq,metrics,consul,\", instance=\"node-03.aws-eu-central-1a.consul.hq\", job=\"consul-metrics\", source=\"slave-01.aws-eu-central-1a.metrics.hq\"}"
ts=2024-03-06T09:57:16.640627601Z caller=push.go:53 level=warn org_id=fake msg="push refused" err="rpc error: code = Code(400) desc = maxFailure (quorum) on a given error family, rpc error: code = Code(400) desc = addr=10.10.0.211:9095 state=ACTIVE zone=, rpc error: code = Code(400) desc = user=fake: err: duplicate sample for timestamp. timestamp=2024-03-06T09:57:07.986Z, series={__name__=\"consul_consul_state_nodes\", datacenter=\"aws-eu-central-1a\", fleet=\"consul.hq\", group=\",consul.hq,metrics,consul,\", instance=\"node-01.aws-eu-central-1a.consul.hq\", job=\"consul-metrics\", source=\"slave-01.aws-eu-central-1a.metrics.hq\"}"
But that doesn't match with number of append failures in the distributor metric.
sum(rate(cortex_request_duration_seconds_bucket{container="distributor"}[5m])) by (route,status_code)
I managed to measure some packet loss between distributor and ingester:
But packet loss shouldn't cause 400s, it would cause packet send retries and eventually timeouts.
Not sure if it matters but distributors sometimes print stuff like this:
2024/03/06 10:13:50 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
2024/03/06 10:13:50 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
2024/03/06 10:16:30 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
2024/03/06 10:16:30 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
Cannot find any issues about it in this repo.
Eventually the cluster explodes and distributor logs are full of this:
ts=2024-03-06T10:22:34.727004737Z
caller=logging.go:86
level=warn
traceID=300fd752f4382b17
msg="
POST /api/v1/push (500) 7.643734ms
Response: \"at least 2 live replicas required, could only find 0 - unhealthy instances: 10.10.0.214:9095,10.10.0.212:9095,10.10.0.211:9095,10.10.0.218:9095,10.10.0.213:9095\\n\"
Which honestly doesn't look like a warning level message and more like a sever error level message.
What baffles me tho is that ingester instances at no point were overloaded, the are mostly idle:
And the load never went above 2.0 on a machine with 8 cores:
So why does it suddenly become not "alive"? No idea.
Was asked to do this query:
histogram_quantile(0.99,sum(rate(cortex_kv_request_duration_seconds_bucket[5m])) without (container,job,pod,instance))
Result:
The distributor in question also was at no point overloaded, load below 3 on a host with 8 cores:
And plenty of free RAM left.
@jakubgs can you share your distributor config too? thanks
Sure:
---
target: 'distributor'
auth_enabled: false
# ---------------------- Configs --------------------------
configs:
database:
uri: 'memory://'
# ---------------------- Limits ---------------------------
limits:
ingestion_rate: 1000000
ingestion_burst_size: 2000000
# Max active metrics with meta per user, per ingester.
max_metadata_per_user: 128000
# Limit impact from high cardinality.
max_series_per_metric: 60000
max_series_per_user: 5000000
max_label_names_per_series: 30
# Delete blocks containing samples older than this.
compactor_blocks_retention_period: 0
# Maximum accepted sample age before rejecting.
reject_old_samples: true
reject_old_samples_max_age: 10m
# Allowed time window for ingestion of out-of-order samples.
out_of_order_time_window: 5m
# ---------------------- Server ---------------------------
server:
http_listen_address: '0.0.0.0'
http_listen_port: 9092
grpc_listen_address: '0.0.0.0'
grpc_listen_port: 9095
log_level: 'info'
# Big queries need bigger message size.
grpc_server_max_recv_msg_size: 104857600
grpc_server_max_send_msg_size: 16777216
# Bump gRPC concurrency to avoid delays.
grpc_server_max_concurrent_streams: 1000
# ---------------------- Storage --------------------------
storage:
engine: 'blocks'
# ---------------------- Distributor ----------------------
distributor:
# Low values cause `context deadline exceeded` on push to ingesters.
remote_timeout: '30s'
ring:
kvstore:
store: etcd
etcd:
username: cortex
password: SECRET
endpoints: ['10.10.0.13:2379', '10.10.0.14:2379', '10.10.0.15:2379']
# ---------------------- Ingester -------------------------
ingester:
lifecycler:
tokens_file_path: '/var/tmp/cortex/wal/tokens'
interface_names: [wg0]
ring:
kvstore:
store: etcd
etcd:
username: cortex
password: SECRET
endpoints: ['10.10.0.13:2379', '10.10.0.14:2379', '10.10.0.15:2379']
# Big queries need bigger message size.
grpc_server_max_recv_msg_size: 104857600
grpc_server_max_send_msg_size: 16777216
# Bump gRPC concurrency to avoid delays.
grpc_server_max_concurrent_streams: 1000
I have never changed that. I ran clusters all sizes without ever touching that. I use https://github.com/cortexproject/cortex-jsonnet/blob/main/cortex/distributor.libsonnet#L24-L26 for distributors
It's needed for ingesters though https://github.com/cortexproject/cortex-jsonnet/blob/main/cortex/ingester.libsonnet#L34-L36
I can remove it from distributors config, but I doubt that has any bearing on the latency.
I have applied your suggested changes to GRPC server configuration, but the latency issues persist.
From your message 10.10.0.214:9095,10.10.0.212:9095,10.10.0.211:9095,10.10.0.218:9095,10.10.0.213:9095 are the problematic ingesters. it might be worth checking if it is always the same ingesters in the message
Those are all my ingesters.
What I also wonder is why are distributor hosts generating 10x the amount of outgoing traffic compared to incoming:
How does that happen? Does the distributor receive stuff compressed and sends to ingesters uncompressed or something?
I don't understand, every time samples increase beyond 13k samples:
Ingester append failures(400s) start to raise:
And push latency explodes:
And eventually everything crashes, all while ingesters and distributors are mostly idle:
I'm baffled.
And when everything crashes distributor logs are full of this:
ts=2024-03-06T14:53:06.762792796Z caller=logging.go:86 level=warn traceID=33cb0c7912c40196
msg="POST /api/v1/push (500) 102.578691ms Response: \"No metric name label\\n\" ws: false; Content-Encoding: snappy; Content-Length: 51463; Content-Type: application/x-protobuf; Retry-Attempt: 50; User-Agent: Prometheus/2.50.1; X-Forwarded-For: 172.17.2.2; X-Forwarded-Proto: http; X-Prometheus-Remote-Write-Version: 0.1.0; X-Real-Ip: 172.17.2.2; "
No idea why. Related issue:
- https://github.com/cortexproject/cortex/issues/5802
Restarting Prometheus makes it go away. No idea why.
I also get this when eventually all comes crashing down:
ts=2024-03-06T15:00:20.39654373Z caller=push.go:34 level=error org_id=fake err="unexpected EOF"
ts=2024-03-06T15:00:20.42643706Z caller=push.go:34 level=error org_id=fake err="unexpected EOF"
ts=2024-03-06T15:00:20.438756524Z caller=push.go:34 level=error org_id=fake err="unexpected EOF"
What I also wonder is why are distributor hosts generating 10x the amount of outgoing traffic compared to incoming:
That seems the case. From your configuration, I don't see you enable gRPC compression on the gRPC client to ingester. -ingester.client.grpc-compression.
Response: "at least 2 live replicas required, could only find 0 - unhealthy instances: 10.10.0.214:9095,10.10.0.212:9095,10.10.0.211:9095,10.10.0.218:9095,10.10.0.213:9095\n"
It seems that your Ingesters failed Health check. The reason of the failure could vary. They will be considered unhealthy if they don't heartbeat of Ring kvstore (etcd in this case) in time. I recommend you to check the Ingester Ring status.
There is a web UI exposed and you can look at whether those ingesters are healthy or not.
@yeya24 but I see no compression flag for distributors, these are the only ones I can find: https://cortexmetrics.io/docs/configuration/configuration-file/
# Use compression when sending messages. Supported values are: 'gzip',
# 'snappy', 'snappy-block' ,'zstd' and '' (disable compression)
# CLI flag: -query-scheduler.grpc-client-config.grpc-compression
[grpc_compression: <string> | default = ""]
# Use compression when sending messages. Supported values are: 'gzip',
# 'snappy' and '' (disable compression)
# CLI flag: -alertmanager.alertmanager-client.grpc-compression
[grpc_compression: <string> | default = ""]
# Use compression when sending messages. Supported values are: 'gzip',
# 'snappy', 'snappy-block' ,'zstd' and '' (disable compression)
# CLI flag: -querier.frontend-client.grpc-compression
[grpc_compression: <string> | default = ""]
# Use compression when sending messages. Supported values are: 'gzip',
# 'snappy', 'snappy-block' ,'zstd' and '' (disable compression)
# CLI flag: -ingester.client.grpc-compression
[grpc_compression: <string> | default = ""]
# Use compression when sending messages. Supported values are: 'gzip',
# 'snappy' and '' (disable compression)
# CLI flag: -querier.store-gateway-client.grpc-compression
[grpc_compression: <string> | default = ""]
# Use compression when sending messages. Supported values are: 'gzip',
# 'snappy', 'snappy-block' ,'zstd' and '' (disable compression)
# CLI flag: -frontend.grpc-client-config.grpc-compression
[grpc_compression: <string> | default = ""]
# Use compression when sending messages. Supported values are: 'gzip',
# 'snappy', 'snappy-block' ,'zstd' and '' (disable compression)
# CLI flag: -ruler.client.grpc-compression
[grpc_compression: <string> | default = ""]
Is it just not documented?
You're suggesting to use ingester.client.grpc-compression, but the documentation says:
# Use compression when sending messages. Supported values are: 'gzip',
# 'snappy', 'snappy-block' ,'zstd' and '' (disable compression)
But isn't the distributor sending messages to ingester and not ingester sending them to distributor? Confusing.
Also, which type of compression do you recommend?
Oh, I see, the ingester_client configuration is different from ingster configuration.
Any good reason why compression isn't enabled by default?
Indeed, enabling compression has reduced outgoing traffic from distributors by 10x, which not is the same as incoming:
This alone might make a big difference.