[k8s/mimir-distributed] ingesters crash

Open qdupuy opened this issue 3 years ago • 1 comments

Describe the bug

The ingesting pods crash from time to time and also cause a failure on the Readiness

To Reproduce

Steps to reproduce the behavior:

Deploy mimir distributed with the helm charts in its version mimir-distributed-3.0.0

Expected behavior

I would like to understand the error and correct it. Also, I would like to know if my configuration is correct

Environment

Infrastructure: AKS Kubernetes
Deployment tool: Helm charts

Additional Context

For your information, I’m at 40000 requests/second

Mimir configuration :

config: |
tenant_federation:
  enabled: true

activity_tracker:
  filepath: /data/metrics-activity.log

alertmanager:
  data_dir: /data
  enable_api: true
  external_url: /alertmanager
  fallback_config_file: ""

frontend:
  align_queries_with_step: true
  log_queries_longer_than: 10s
  {{- if index .Values "results-cache" "enabled" }}
  results_cache:
    backend: memcached
    memcached:
      addresses: {{ include "mimir.resultsCacheAddress" . }}
      max_item_size: {{ mul (index .Values "results-cache").maxItemMemory 1024 1024 }}
  cache_results: true
  {{- end }}

frontend_worker:
  frontend_address: {{ template "mimir.fullname" . }}-query-frontend-headless.{{ .Release.Namespace }}.svc:{{ include "mimir.serverGrpcListenPort" . }}

ingester:
  ring:
    final_sleep: 0s
    num_tokens: 512
    unregister_on_shutdown: true
    replication_factor: 3

limits:
  max_label_names_per_series: 50
  ingestion_tenant_shard_size: 5
  compactor_tenant_shard_size: 2
  compactor_blocks_retention_period: 15m
  out_of_order_time_window: 15m
  max_global_series_per_user: 1000000
  ingestion_burst_size: 400000
  max_global_series_per_metric: 50000
  max_fetched_chunks_per_query: 3000000
  query_sharding_total_shards: 24
  query_sharding_max_sharded_queries: 256
  store_gateway_tenant_shard_size: 2
  compactor_split_and_merge_shards: 2

memberlist:
  abort_if_cluster_join_fails: false
  compression_enabled: false
  join_members:
  - {{ include "mimir.fullname" . }}-gossip-ring
  
ruler:
  alertmanager_url: dnssrvnoa+http://_http-metrics._tcp.{{ template "mimir.fullname" . }}-alertmanager-headless.{{ .Release.Namespace }}.svc.{{ .Values.global.clusterDomain }}/alertmanager
  enable_api: false
  rule_path: /data

compactor:
  data_dir: "/data"
  block_ranges: [1h0m0s,4h0m0s,12h0m0s]

runtime_config:
  file: /var/{{ include "mimir.name" . }}/runtime.yaml

ingester_client:
  grpc_client_config:
    max_send_msg_size: 67108864
    # max_recv_msg_size: 104857600
    # max_send_msg_size: 104857600

server: 
  # grpc_server_max_recv_msg_size: 104857600
  # grpc_server_max_send_msg_size: 104857600
  # grpc_server_max_concurrent_streams: 1000
  log_level: warn
  
alertmanager_storage:
  backend: azure
  azure:
    container_name: mimir-ruler
    account_name: ${STORAGE_ACCOUNT_NAME}
    account_key: ${STORAGE_ACCOUNT_KEY}

blocks_storage:
  backend: azure
  tsdb:
    dir: /data/tsdb
    retention_period: 2h
    block_ranges_period: [30m,1h,2h]
  bucket_store:
    sync_dir: /data/tsdb-sync
    tenant_sync_concurrency: 20
    block_sync_concurrency: 40
    sync_interval: 10m
    consistency_delay: 10m
    {{- if index .Values "chunks-cache" "enabled" }}
    chunks_cache:
      backend: memcached
      memcached:
        addresses: {{ include "mimir.chunksCacheAddress" . }}
        max_item_size: {{ mul (index .Values "chunks-cache").maxItemMemory 1024 1024 }}
        timeout: 450ms
    {{- end }}
    {{- if index .Values "index-cache" "enabled" }}
    index_cache:
      backend: memcached
      memcached:
        addresses: {{ include "mimir.indexCacheAddress" . }}
        max_item_size: {{ mul (index .Values "index-cache").maxItemMemory 1024 1024 }}
    {{- end }}
    {{- if index .Values "metadata-cache" "enabled" }}
    metadata_cache:
      backend: memcached
      memcached:
        addresses: {{ include "mimir.metadataCacheAddress" . }}
        max_item_size: {{ mul (index .Values "metadata-cache").maxItemMemory 1024 1024 }}
    {{- end }}
  azure:
    container_name: mimir-tsdb
    account_name: ${STORAGE_ACCOUNT_NAME}
    account_key: ${STORAGE_ACCOUNT_KEY}

ruler_storage:
  backend: azure
  azure:
    container_name: mimir-ruler
    account_name: ${STORAGE_ACCOUNT_NAME}
    account_key: ${STORAGE_ACCOUNT_KEY}

pods events of ingesters :

Events:
  Type     Reason     Age                 From               Message
  ----     ------     ----                ----               -------
  Normal   Scheduled  52m                 default-scheduler  Successfully assigned monitoring/mimir-ingester-0 to aks-mimir-11018964-vmss000005
  Warning  Unhealthy  50m                 kubelet            Readiness probe failed: Get "http://10.244.4.252:8080/ready": dial tcp 10.244.4.252:8080: connect: connection refused
  Normal   Pulled     21m (x3 over 52m)   kubelet            Container image "grafana/mimir:2.2.0" already present on machine
  Normal   Created    21m (x3 over 52m)   kubelet            Created container ingester
  Normal   Started    21m (x3 over 52m)   kubelet            Started container ingester
  Warning  Unhealthy  19m (x21 over 51m)  kubelet            Readiness probe failed: HTTP probe failed with statuscode: 503

Pod consumption :

mimir-chunks-cache-0                                    exporter                1m           16Mi            
mimir-chunks-cache-0                                    memcached               1m           2Mi             
mimir-compactor-0                                       compactor               23m          142Mi           
mimir-compactor-1                                       compactor               24m          131Mi           
mimir-compactor-2                                       compactor               22m          103Mi           
mimir-distributor-77456b7868-6tktr                      distributor             232m         260Mi           
mimir-distributor-77456b7868-dgjl4                      distributor             208m         237Mi           
mimir-distributor-77456b7868-g6v56                      distributor             245m         394Mi           
mimir-distributor-77456b7868-pd89r                      distributor             229m         283Mi           
mimir-index-cache-0                                     exporter                1m           15Mi            
mimir-index-cache-0                                     memcached               1m           5Mi             
mimir-index-cache-1                                     exporter                1m           19Mi            
mimir-index-cache-1                                     memcached               1m           5Mi             
mimir-ingester-0                                        ingester                518m         11097Mi         
mimir-ingester-1                                        ingester                537m         14980Mi         
mimir-ingester-2                                        ingester                474m         11012Mi         
mimir-ingester-3                                        ingester                622m         14081Mi         
mimir-ingester-4                                        ingester                202m         14158Mi         
mimir-metadata-cache-0                                  exporter                1m           16Mi            
mimir-metadata-cache-0                                  memcached               1m           20Mi            
mimir-nginx-8bf79cb5d-dczdr                             nginx                   24m          20Mi            
mimir-nginx-8bf79cb5d-qxd5h                             nginx                   14m          19Mi            
mimir-overrides-exporter-75599ffcd8-4lhm7               overrides-exporter      3m           31Mi            
mimir-overrides-exporter-75599ffcd8-7jdpp               overrides-exporter      6m           26Mi            
mimir-overrides-exporter-75599ffcd8-gh9ch               overrides-exporter      5m           26Mi            
mimir-querier-bdff59f6d-h5jhg                           querier                 24m          97Mi            
mimir-querier-bdff59f6d-kl2fd                           querier                 31m          101Mi           
mimir-querier-bdff59f6d-q4mhr                           querier                 24m          94Mi            
mimir-query-frontend-684465c696-gtcj7                   query-frontend          18m          42Mi            
mimir-query-frontend-684465c696-jsqzk                   query-frontend          4m           92Mi            
mimir-query-frontend-684465c696-z2pdx                   query-frontend          8m           66Mi            
mimir-results-cache-0                                   exporter                1m           15Mi            
mimir-results-cache-0                                   memcached               1m           2Mi             
mimir-store-gateway-0                                   store-gateway           26m          48Mi            
mimir-store-gateway-1                                   store-gateway           26m          48Mi            
mimir-store-gateway-2                                   store-gateway           25m          49Mi

pv/pvc : OK

CPU usage


{instance="1:9100"} | 86.02500000011382
{instance="2:9100"} | 86.40000000007501
{instance="3:9100"} | 89.17777777791747
{instance="4:9100"} | 81.83888888891993
{instance="5:9100"} | 87.47222222245504
{instance="6:9100"} | 78.9333333333747
{instance="7:9100"} | 90.17916666654249
{instance="8:9100"} | 88.7888888888766
{instance="9:9100"} | 91.59444444450652
{instance="10:9100"} | 83.88888888888889

remaining memory :

{container="node-exporter", endpoint="http-metrics", instance="1:9100", job="node-exporter", namespace="monitoring", pod="prom-operator-prometheus-node-exporter-n5zpp", service="prom-operator-prometheus-node-exporter"}
	48.441615756528584
{container="node-exporter", endpoint="http-metrics", instance="2:9100", job="node-exporter", namespace="monitoring", pod="prom-operator-prometheus-node-exporter-54kwd", service="prom-operator-prometheus-node-exporter"}
	52.32184776613751
{container="node-exporter", endpoint="http-metrics", instance="3:9100", job="node-exporter", namespace="monitoring", pod="prom-operator-prometheus-node-exporter-cvff8", service="prom-operator-prometheus-node-exporter"}
	52.46110608179796
{container="node-exporter", endpoint="http-metrics", instance="4:9100", job="node-exporter", namespace="monitoring", pod="prom-operator-prometheus-node-exporter-jb7vs", service="prom-operator-prometheus-node-exporter"}
	60.33984781513333
{container="node-exporter", endpoint="http-metrics", instance="5:9100", job="node-exporter", namespace="monitoring", pod="prom-operator-prometheus-node-exporter-2qfrv", service="prom-operator-prometheus-node-exporter"}
	78.05270715576317
{container="node-exporter", endpoint="http-metrics", instance="6:9100", job="node-exporter", namespace="monitoring", pod="prom-operator-prometheus-node-exporter-9lhj2", service="prom-operator-prometheus-node-exporter"}
	65.21640043871795
{container="node-exporter", endpoint="http-metrics", instance="7:9100", job="node-exporter", namespace="monitoring", pod="prom-operator-prometheus-node-exporter-96zpn", service="prom-operator-prometheus-node-exporter"}
	81.3799639519442
{container="node-exporter", endpoint="http-metrics", instance="8:9100", job="node-exporter", namespace="monitoring", pod="prom-operator-prometheus-node-exporter-tz5zz", service="prom-operator-prometheus-node-exporter"}
	80.22284979832604
{container="node-exporter", endpoint="http-metrics", instance="9:9100", job="node-exporter", namespace="monitoring", pod="prom-operator-prometheus-node-exporter-95q5j", service="prom-operator-prometheus-node-exporter"}
	85.03505717607645
{container="node-exporter", endpoint="http-metrics", instance="10:9100", job="node-exporter", namespace="monitoring", pod="prom-operator-prometheus-node-exporter-mg9gt", service="prom-operator-prometheus-node-exporter"}
	24.80914628375658

Aug 12 '22 08:08 qdupuy

Hi,

assuming ingester 0 crashed: kubectl describe pod mimi-ingester-0 should have a part called Last State which tells you the reason, e.g.:

Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Thu, 11 Aug 2022 14:16:00 +0200
      Finished:     Fri, 12 Aug 2022 19:20:56 +0200

Also kubectl logs -p mimir-ingester-0 would give previous (before restart) logs of the pod.

In Grafana cloud we run meta monitoring: set up metrics collections about mimir itself, e.g. https://grafana.com/docs/mimir/latest/operators-guide/monitoring-grafana-mimir/collecting-metrics-and-logs/ This can give you metrics and logs collection to check the state of the system before the crash.

Aug 15 '22 14:08 krajorama