cortex icon indicating copy to clipboard operation
cortex copied to clipboard

Metrics show append failures but logs show none

Open jakubgs opened this issue 1 year ago • 40 comments

Describe the bug I'm seeing lots of errors from cortex_distributor_ingester_append_failures_total metric on one of our distributors:

cortex_distributor_ingester_append_failures_total{ingester="10.10.0.211:9095",status="4xx",type="samples"} 318
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.211:9095",status="5xx",type="metadata"} 14
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.211:9095",status="5xx",type="samples"} 1670
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.212:9095",status="4xx",type="samples"} 248
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.212:9095",status="5xx",type="metadata"} 13
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.212:9095",status="5xx",type="samples"} 1991
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.213:9095",status="4xx",type="samples"} 68
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.213:9095",status="5xx",type="samples"} 23041
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.214:9095",status="4xx",type="samples"} 128
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.214:9095",status="5xx",type="metadata"} 44
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.214:9095",status="5xx",type="samples"} 5642
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.218:9095",status="4xx",type="samples"} 97
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.218:9095",status="5xx",type="samples"} 36903

These errors can be seen in the graph:

image

But when I log onto affected distributor and ingester I cannot find any errors logged, even on debug level:

ts=2024-03-06T09:53:16.116085832Z caller=grpc_logging.go:46 level=debug method=/cortex.Ingester/Push duration=4.272868ms msg="gRPC (success)"
ts=2024-03-06T09:53:16.133388069Z caller=grpc_logging.go:46 level=debug method=/cortex.Ingester/Push duration=5.270535ms msg="gRPC (success)"
ts=2024-03-06T09:53:16.146464557Z caller=grpc_logging.go:46 level=debug method=/cortex.Ingester/Push duration=4.954335ms msg="gRPC (success)"
ts=2024-03-06T09:53:16.157419974Z caller=grpc_logging.go:46 level=debug method=/cortex.Ingester/Push duration=4.418061ms msg="gRPC (success)"

To Reproduce I have no idea.

Expected behavior Either Distributor or Ingester shows errors in the logs so the issue can be actually debugged.

Environment: Prometheus 2.50.1 sending to Cortex 1.16.0.

jakubgs avatar Mar 06 '24 09:03 jakubgs

As recommended by @friedrichg I made this query for affected ingester:

histogram_quantile(0.99,sum(rate(cortex_request_duration_seconds_bucket{container="ingester",route="/cortex.Ingester/Push"}[5m])) by (le,route,status_code))

But the ingester appeared to be healthy:

image

jakubgs avatar Mar 06 '24 10:03 jakubgs

Then it was recommended to query this for the affected distributor:

histogram_quantile(0.99,sum(rate(cortex_request_duration_seconds_bucket{container="distributor"}[5m])) by (le,route,status_code))

And the result was:

image

Which shows 400 errors for push requests. But no 400 errors were logged by distributor set to debug log level.

jakubgs avatar Mar 06 '24 10:03 jakubgs

Share

sum(rate(cortex_request_duration_seconds_bucket{container="distributor"}[5m])) by (route,status_code)

If you can

friedrichg avatar Mar 06 '24 10:03 friedrichg

The only two warnings I managed to find were:

ts=2024-03-06T09:40:15.643029269Z caller=push.go:53 level=warn org_id=fake msg="push refused" err="rpc error: code = Code(400) desc = maxFailure (quorum) on a given error family, rpc error: code = Code(400) desc = addr=10.10.0.211:9095 state=ACTIVE zone=, rpc error: code = Code(400) desc = user=fake: err: duplicate sample for timestamp. timestamp=2024-03-06T09:40:09.183Z, series={__name__=\"consul_consul_members_clients\", datacenter=\"aws-eu-central-1a\", fleet=\"consul.hq\", group=\",consul.hq,metrics,consul,\", instance=\"node-03.aws-eu-central-1a.consul.hq\", job=\"consul-metrics\", source=\"slave-01.aws-eu-central-1a.metrics.hq\"}"
ts=2024-03-06T09:57:16.640627601Z caller=push.go:53 level=warn org_id=fake msg="push refused" err="rpc error: code = Code(400) desc = maxFailure (quorum) on a given error family, rpc error: code = Code(400) desc = addr=10.10.0.211:9095 state=ACTIVE zone=, rpc error: code = Code(400) desc = user=fake: err: duplicate sample for timestamp. timestamp=2024-03-06T09:57:07.986Z, series={__name__=\"consul_consul_state_nodes\", datacenter=\"aws-eu-central-1a\", fleet=\"consul.hq\", group=\",consul.hq,metrics,consul,\", instance=\"node-01.aws-eu-central-1a.consul.hq\", job=\"consul-metrics\", source=\"slave-01.aws-eu-central-1a.metrics.hq\"}"

But that doesn't match with number of append failures in the distributor metric.

jakubgs avatar Mar 06 '24 10:03 jakubgs

sum(rate(cortex_request_duration_seconds_bucket{container="distributor"}[5m])) by (route,status_code)

image

jakubgs avatar Mar 06 '24 10:03 jakubgs

I managed to measure some packet loss between distributor and ingester:

image

But packet loss shouldn't cause 400s, it would cause packet send retries and eventually timeouts.

jakubgs avatar Mar 06 '24 10:03 jakubgs

Not sure if it matters but distributors sometimes print stuff like this:


2024/03/06 10:13:50 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
2024/03/06 10:13:50 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
2024/03/06 10:16:30 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
2024/03/06 10:16:30 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".

Cannot find any issues about it in this repo.

jakubgs avatar Mar 06 '24 10:03 jakubgs

Eventually the cluster explodes and distributor logs are full of this:

ts=2024-03-06T10:22:34.727004737Z
caller=logging.go:86
level=warn
traceID=300fd752f4382b17
msg="
  POST /api/v1/push (500) 7.643734ms
  Response: \"at least 2 live replicas required, could only find 0 - unhealthy instances: 10.10.0.214:9095,10.10.0.212:9095,10.10.0.211:9095,10.10.0.218:9095,10.10.0.213:9095\\n\"

Which honestly doesn't look like a warning level message and more like a sever error level message.

What baffles me tho is that ingester instances at no point were overloaded, the are mostly idle:

image

And the load never went above 2.0 on a machine with 8 cores:

image

So why does it suddenly become not "alive"? No idea.

jakubgs avatar Mar 06 '24 10:03 jakubgs

Was asked to do this query:

histogram_quantile(0.99,sum(rate(cortex_kv_request_duration_seconds_bucket[5m])) without (container,job,pod,instance))

Result:

image

jakubgs avatar Mar 06 '24 10:03 jakubgs

The distributor in question also was at no point overloaded, load below 3 on a host with 8 cores:

image

And plenty of free RAM left.

jakubgs avatar Mar 06 '24 10:03 jakubgs

@jakubgs can you share your distributor config too? thanks

friedrichg avatar Mar 06 '24 10:03 friedrichg

Sure:

---
target: 'distributor'
auth_enabled: false

# ---------------------- Configs --------------------------
configs:
  database:
    uri: 'memory://'

# ---------------------- Limits ---------------------------
limits:
  ingestion_rate: 1000000
  ingestion_burst_size: 2000000
  # Max active metrics with meta per user, per ingester.
  max_metadata_per_user: 128000
  # Limit impact from high cardinality.
  max_series_per_metric: 60000
  max_series_per_user: 5000000
  max_label_names_per_series: 30
  # Delete blocks containing samples older than this.
  compactor_blocks_retention_period: 0
  # Maximum accepted sample age before rejecting.
  reject_old_samples: true
  reject_old_samples_max_age: 10m
  # Allowed time window for ingestion of out-of-order samples.
  out_of_order_time_window: 5m

# ---------------------- Server ---------------------------
server:
  http_listen_address: '0.0.0.0'
  http_listen_port: 9092
  grpc_listen_address: '0.0.0.0'
  grpc_listen_port: 9095
  log_level: 'info'

  # Big queries need bigger message size.
  grpc_server_max_recv_msg_size: 104857600
  grpc_server_max_send_msg_size: 16777216
  # Bump gRPC concurrency to avoid delays.
  grpc_server_max_concurrent_streams: 1000

# ---------------------- Storage --------------------------
storage:
  engine: 'blocks'

# ---------------------- Distributor ----------------------
distributor:
  # Low values cause `context deadline exceeded` on push to ingesters.
  remote_timeout: '30s'

  ring:
    kvstore:
      store: etcd
      etcd:
        username: cortex
        password: SECRET
        endpoints: ['10.10.0.13:2379', '10.10.0.14:2379', '10.10.0.15:2379']

# ---------------------- Ingester -------------------------
ingester:
  lifecycler:
    tokens_file_path: '/var/tmp/cortex/wal/tokens'
    interface_names: [wg0]

    ring:
      kvstore:
        store: etcd
        etcd:
          username: cortex
          password: SECRET
          endpoints: ['10.10.0.13:2379', '10.10.0.14:2379', '10.10.0.15:2379']

jakubgs avatar Mar 06 '24 10:03 jakubgs

  # Big queries need bigger message size.
  grpc_server_max_recv_msg_size: 104857600
  grpc_server_max_send_msg_size: 16777216
  # Bump gRPC concurrency to avoid delays.
  grpc_server_max_concurrent_streams: 1000

I have never changed that. I ran clusters all sizes without ever touching that. I use https://github.com/cortexproject/cortex-jsonnet/blob/main/cortex/distributor.libsonnet#L24-L26 for distributors

friedrichg avatar Mar 06 '24 10:03 friedrichg

It's needed for ingesters though https://github.com/cortexproject/cortex-jsonnet/blob/main/cortex/ingester.libsonnet#L34-L36

friedrichg avatar Mar 06 '24 10:03 friedrichg

I can remove it from distributors config, but I doubt that has any bearing on the latency.

jakubgs avatar Mar 06 '24 12:03 jakubgs

I have applied your suggested changes to GRPC server configuration, but the latency issues persist.

jakubgs avatar Mar 06 '24 13:03 jakubgs

From your message 10.10.0.214:9095,10.10.0.212:9095,10.10.0.211:9095,10.10.0.218:9095,10.10.0.213:9095 are the problematic ingesters. it might be worth checking if it is always the same ingesters in the message

friedrichg avatar Mar 06 '24 14:03 friedrichg

Those are all my ingesters.

jakubgs avatar Mar 06 '24 14:03 jakubgs

What I also wonder is why are distributor hosts generating 10x the amount of outgoing traffic compared to incoming:

image

How does that happen? Does the distributor receive stuff compressed and sends to ingesters uncompressed or something?

jakubgs avatar Mar 06 '24 14:03 jakubgs

I don't understand, every time samples increase beyond 13k samples:

image

Ingester append failures(400s) start to raise:

image

And push latency explodes:

image image

And eventually everything crashes, all while ingesters and distributors are mostly idle:

image

I'm baffled.

jakubgs avatar Mar 06 '24 14:03 jakubgs

And when everything crashes distributor logs are full of this:

ts=2024-03-06T14:53:06.762792796Z caller=logging.go:86 level=warn traceID=33cb0c7912c40196
msg="POST /api/v1/push (500) 102.578691ms Response: \"No metric name label\\n\" ws: false; Content-Encoding: snappy; Content-Length: 51463; Content-Type: application/x-protobuf; Retry-Attempt: 50; User-Agent: Prometheus/2.50.1; X-Forwarded-For: 172.17.2.2; X-Forwarded-Proto: http; X-Prometheus-Remote-Write-Version: 0.1.0; X-Real-Ip: 172.17.2.2; "

No idea why. Related issue:

  • https://github.com/cortexproject/cortex/issues/5802

Restarting Prometheus makes it go away. No idea why.

jakubgs avatar Mar 06 '24 14:03 jakubgs

I also get this when eventually all comes crashing down:

ts=2024-03-06T15:00:20.39654373Z caller=push.go:34 level=error org_id=fake err="unexpected EOF"
ts=2024-03-06T15:00:20.42643706Z caller=push.go:34 level=error org_id=fake err="unexpected EOF"
ts=2024-03-06T15:00:20.438756524Z caller=push.go:34 level=error org_id=fake err="unexpected EOF"

jakubgs avatar Mar 06 '24 15:03 jakubgs

What I also wonder is why are distributor hosts generating 10x the amount of outgoing traffic compared to incoming:

That seems the case. From your configuration, I don't see you enable gRPC compression on the gRPC client to ingester. -ingester.client.grpc-compression.

yeya24 avatar Mar 06 '24 16:03 yeya24

Response: "at least 2 live replicas required, could only find 0 - unhealthy instances: 10.10.0.214:9095,10.10.0.212:9095,10.10.0.211:9095,10.10.0.218:9095,10.10.0.213:9095\n"

It seems that your Ingesters failed Health check. The reason of the failure could vary. They will be considered unhealthy if they don't heartbeat of Ring kvstore (etcd in this case) in time. I recommend you to check the Ingester Ring status.

There is a web UI exposed and you can look at whether those ingesters are healthy or not.

yeya24 avatar Mar 06 '24 16:03 yeya24

@yeya24 but I see no compression flag for distributors, these are the only ones I can find: https://cortexmetrics.io/docs/configuration/configuration-file/

    # Use compression when sending messages. Supported values are: 'gzip',
    # 'snappy', 'snappy-block' ,'zstd' and '' (disable compression)
    # CLI flag: -query-scheduler.grpc-client-config.grpc-compression
    [grpc_compression: <string> | default = ""]
  # Use compression when sending messages. Supported values are: 'gzip',
  # 'snappy' and '' (disable compression)
  # CLI flag: -alertmanager.alertmanager-client.grpc-compression
  [grpc_compression: <string> | default = ""]
  # Use compression when sending messages. Supported values are: 'gzip',
  # 'snappy', 'snappy-block' ,'zstd' and '' (disable compression)
  # CLI flag: -querier.frontend-client.grpc-compression
  [grpc_compression: <string> | default = ""]
  # Use compression when sending messages. Supported values are: 'gzip',
  # 'snappy', 'snappy-block' ,'zstd' and '' (disable compression)
  # CLI flag: -ingester.client.grpc-compression
  [grpc_compression: <string> | default = ""]
  # Use compression when sending messages. Supported values are: 'gzip',
  # 'snappy' and '' (disable compression)
  # CLI flag: -querier.store-gateway-client.grpc-compression
  [grpc_compression: <string> | default = ""]
  # Use compression when sending messages. Supported values are: 'gzip',
  # 'snappy', 'snappy-block' ,'zstd' and '' (disable compression)
  # CLI flag: -frontend.grpc-client-config.grpc-compression
  [grpc_compression: <string> | default = ""]
  # Use compression when sending messages. Supported values are: 'gzip',
  # 'snappy', 'snappy-block' ,'zstd' and '' (disable compression)
  # CLI flag: -ruler.client.grpc-compression
  [grpc_compression: <string> | default = ""]

Is it just not documented?

jakubgs avatar Mar 06 '24 17:03 jakubgs

You're suggesting to use ingester.client.grpc-compression, but the documentation says:

  # Use compression when sending messages. Supported values are: 'gzip',
  # 'snappy', 'snappy-block' ,'zstd' and '' (disable compression)

But isn't the distributor sending messages to ingester and not ingester sending them to distributor? Confusing.

jakubgs avatar Mar 06 '24 17:03 jakubgs

Also, which type of compression do you recommend?

jakubgs avatar Mar 06 '24 17:03 jakubgs

Oh, I see, the ingester_client configuration is different from ingster configuration.

jakubgs avatar Mar 06 '24 17:03 jakubgs

Any good reason why compression isn't enabled by default?

jakubgs avatar Mar 06 '24 17:03 jakubgs

Indeed, enabling compression has reduced outgoing traffic from distributors by 10x, which not is the same as incoming:

image

This alone might make a big difference.

jakubgs avatar Mar 06 '24 17:03 jakubgs