loki copied to clipboard
autoforget_unhealthy isn't working as expected for Ingesters.
Describe the bug I have enabled autoforget_unhealthy for ingesters. When ingester pod starts running, it mentions the same.
level=info ts=2022-06-16T02:27:15.182820969Z caller=ingester.go:308 msg="autoforget is enabled and will remove unhealthy instances from the ring after 1m0s with no heartbeat"
It then complains that there is an instance with problem and asks me to manually cleanup on /ring endpoint.
level=warn ts=2022-06-16T02:27:45.421965683Z caller=lifecycler.go:245 msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resolved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance past heartbeat timeout"
To Reproduce Steps to reproduce the behavior: Restarted ingesters after setting autoforget_unhealthy flag to true.
Expected behavior Expected the unhealthy ingesters to be cleaned automatically.
- Infrastructure: Loki v2.5.0 on GKE
- Deployment tool: Helm
apiVersion: v1
config.yaml: |-
"auth_enabled": false
"compaction_interval": "10m"
"shared_store": "gcs"
"working_directory": "/data/loki/compactor"
"retention_enabled": true
"store": "memberlist"
"compress_responses": false
"max_outstanding_per_tenant": 2048
"tail_proxy_url": "http://querier.logs.svc.cluster.local:3100"
"frontend_address": "queryfrontend.logs.svc.cluster.local:9095"
"max_send_msg_size": 1104857600
"parallelism": 256
"chunk_block_size": 262144
"chunk_target_size": 1536000
"chunk_encoding": "snappy"
"chunk_idle_period": "30m"
"autoforget_unhealthy": true
"heartbeat_period": "1m"
- "eth0"
"num_tokens": 512
"store": "memberlist"
"heartbeat_timeout": "1m"
"replication_factor": 1
"max_transfer_retries": 0
"enabled": true
"dir": "data"
"max_recv_msg_size": 1104857600
"max_send_msg_size": 1104857600
"backoff_on_ratelimits": true
"min_period": "1s"
"max_period": "32s"
"max_retries": 10
"remote_timeout": "10s"
"enforce_metric_name": false
"ingestion_burst_size_mb": 512
"ingestion_rate_mb": 256
"bind_port": 7946
- "gossip-ring.logs.svc.cluster.local:7946"
"max_join_backoff": "1m"
"max_join_retries": 10
"min_join_backoff": "1s"
"timeout": "15m"
"extra_query_delay": "0s"
"align_queries_with_step": true
"cache_results": true
"max_retries": 0
"parallelise_shardable_queries": true
"expiration": "10800s"
"batch_size": 1024
"parallelism": 300
"host": "memcached-frontend.logs.svc"
"service": "memcached"
- "from": "2020-10-01"
"period": "24h"
"prefix": "loki_index_"
"object_store": "gcs"
"schema": "v11"
"store": "boltdb-shipper"
"graceful_shutdown_timeout": "5s"
"grpc_server_max_concurrent_streams": 1000
"grpc_server_max_recv_msg_size": 1104857600
"grpc_server_max_send_msg_size": 1104857600
"http_listen_port": 3100
"http_server_idle_timeout": "3m"
"http_server_write_timeout": "1m"
"http_server_read_timeout": "15m"
"active_index_directory": "/data/loki/index"
"cache_location": "/data/loki/index_cache"
"cache_ttl": "24h"
"query_ready_num_days": 5
"resync_interval": "5m"
"shared_store": "gcs"
"server_address": "dns:///indexgateway:9095"
"bucket_name": cdl-logs
"expiration": "43200s"
"batch_size": 3096
"parallelism": 256
"host": "memcached-index-queries.logs.svc"
"service": "memcached"
"expiration": "3600s"
"batch_size": 3096
"parallelism": 256
"host": "memcached-chunks.cdl-logs.svc"
"service": "memcached"
overrides.yaml: '{}'
kind: ConfigMap
meta.helm.sh/release-name: loki
app.kubernetes.io/instance: loki
app.kubernetes.io/managed-by: Helm
name: loki