loki
loki copied to clipboard
When querier restarts, it always get error `msg="error querying storage" err="context canceled"` on the first query
Describe the bug I run Loki on Kubernetes. When a querier pod restarts, it always takes extremely long to process even the simplest query (resulting in 502 gateway timeout). I observe this issue happening regardless of the number of querier pods or resource limits.
If the queriers have been running for a while, then there is no big issue. However, I often have to restart Loki to apply new configuration (there is no hot reload), so this is a concern.
To Reproduce Steps to reproduce the behavior:
- Started Loki
2.4.1
. - Restart one querier pod.
- Query: content of the query doesn't matter, even reduce time range to 5m.
Expected behavior Here are some logs:
loki-querier-0 loki level=info ts=2021-11-17T18:47:26.295571342Z caller=memberlist_client.go:506 msg="joined memberlist cluster" reached_nodes=2
loki-querier-0 loki level=info ts=2021-11-17T18:47:30.961963639Z caller=table_manager.go:181 msg="downloading all files for table index_18948"
loki-querier-0 loki level=error ts=2021-11-17T18:48:30.956499851Z caller=frontend_processor.go:69 msg="error processing requests" address=100.64.57.240:9095 err="rpc error: code = Unknown desc = context canceled"
loki-querier-0 loki level=error ts=2021-11-17T18:48:30.956751555Z caller=chunk_store.go:524 org_id=fake msg="error querying storage" err="context canceled"
loki-querier-0 loki level=info ts=2021-11-17T18:48:30.957981931Z caller=metrics.go:92 org_id=fake latency=slow query="{log_type=\"orch\"} |= \"error\"" query_type=filter range_type=range length=5m0s step=1s duration=59.997894785s status=400 limit=30 returned_lines=0 throughput=253B total_bytes=15kB
loki-querier-0 loki level=error ts=2021-11-17T18:48:30.95859867Z caller=frontend_processor.go:145 msg="error processing requests" err=EOF
loki-querier-0 loki level=info ts=2021-11-17T18:48:41.276386955Z caller=pool.go:171 msg="removing stale client" addr=100.64.57.62:9095
loki-querier-0 loki level=error ts=2021-11-17T18:49:30.9608462Z caller=frontend_processor.go:69 msg="error processing requests" address=100.64.64.139:9095 err="rpc error: code = Unknown desc = context canceled"
loki-querier-0 loki level=error ts=2021-11-17T18:49:30.960897854Z caller=chunk_store.go:524 org_id=fake msg="error querying storage" err="context canceled"
loki-querier-0 loki level=info ts=2021-11-17T18:49:30.961277887Z caller=metrics.go:92 org_id=fake latency=slow query="{log_type=\"orch\"} |= \"error\"" query_type=filter range_type=range length=5m0s step=1s duration=59.99984329s status=400 limit=30 returned_lines=0 throughput=0B total_bytes=0B
loki-querier-0 loki level=error ts=2021-11-17T18:49:30.961486822Z caller=frontend_processor.go:145 msg="error processing requests" err=EOF
loki-querier-0 loki level=info ts=2021-11-17T18:49:35.223591683Z caller=util.go:109 msg="downloaded file loki-ingester-0-1637163268267231335-1637172900.gz from table index_18948"
loki-querier-0 loki level=info ts=2021-11-17T18:49:35.2292538Z caller=util.go:109 msg="downloaded file loki-ingester-1-1637162924597410047-1637172900.gz from table index_18948"
loki-querier-0 loki level=info ts=2021-11-17T18:49:35.244990554Z caller=util.go:109 msg="downloaded file compactor-1637173543.gz from table index_18948"
It seems from the logs that downloading all files for table index_18948
takes too long so they all time out. But those files are tiny and could be downloaded in <1s.
# aws s3 ls $BUCKET/index/index_18948/
2021-11-17 20:25:44 72085 compactor-1637173543.gz
2021-11-17 20:30:00 1662 loki-ingester-0-1637163268267231335-1637172900.gz
2021-11-17 21:03:49 1073 loki-ingester-0-1637174924055016748-1637175076.gz
2021-11-17 20:31:45 1068 loki-ingester-1-1637162924597410047-1637172900.gz
2021-11-17 21:01:49 1650 loki-ingester-1-1637174853937532883-1637175126.gz
2021-11-17 21:01:49 2218 loki-ingester-1-1637174853937532883-1637175600.gz
After the querier has been running for a while, download happens really really fast:
loki-querier-0 loki level=info ts=2021-11-18T07:10:12.219373707Z caller=table_manager.go:181 msg="downloading all files for table index_18946"
loki-querier-0 loki level=info ts=2021-11-18T07:10:12.332443982Z caller=util.go:109 msg="downloaded file loki-ingester-1-1636987151194890520-1637023500.gz from table index_18946"
loki-querier-0 loki level=info ts=2021-11-18T07:10:12.368661753Z caller=util.go:109 msg="downloaded file loki-ingester-0-1636987229036284040-1637023500.gz from table index_18946"
loki-querier-0 loki level=info ts=2021-11-18T07:10:12.407732991Z caller=util.go:109 msg="downloaded file compactor-1637023954.gz from table index_18946"
Here is the size of those files
# aws s3 ls $BUCKET/index/index_18946/
2021-11-16 02:52:36 102126 compactor-1637023954.gz
2021-11-16 03:01:30 1070 loki-ingester-0-1636987229036284040-1637023500.gz
2021-11-16 03:01:12 959 loki-ingester-1-1636987151194890520-1637023500.gz
Environment:
- Infrastructure: Kubernetes
- Deployment tool: Helm, based on the loki-distributed chart.
Screenshots, Promtail config, or terminal output Here's the logcli output:
# logcli query --stats --since 5m '{log_type="orch"} |= "error"'
...
Error response from server: <html>
<head><title>504 Gateway Time-out</title></head>
<body>
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx/1.19.10</center>
</body>
</html>
(<nil>) attempts remaining: 0
Query failed: Run out of attempts while querying the server
Here's my config:
auth_enabled: false
server:
http_listen_port: 3100
log_level: info
distributor:
ring:
kvstore:
store: memberlist
memberlist:
join_members:
- loki-memberlist
ingester:
lifecycler:
ring:
kvstore:
store: memberlist
replication_factor: 1
chunk_encoding: snappy
chunk_retain_period: 1m
# Disable deprecated handoff feature, use wal instead
max_transfer_retries: 0
wal:
enabled: true
dir: /var/loki-wal
limits_config:
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 168h
max_cache_freshness_per_query: 10m
max_query_parallelism: 2
schema_config:
configs:
- from: "2021-10-08"
store: boltdb-shipper
object_store: s3
schema: v11
index:
prefix: index_
period: 24h
query_range:
align_queries_with_step: true
max_retries: 5
split_queries_by_interval: 15m
frontend_worker:
frontend_address: loki-query-frontend:9095
parallelism: 2
frontend:
log_queries_longer_than: 5s
compress_responses: true
tail_proxy_url: http://loki-querier:3100
We tried adding storage_config.aws.http_config.response_header_timeout: 5s
, which eliminates the issue when queriers restart.
However, issues still happen every once in a while. It's hard to know exactly when, but I observe that it happens when querying for the first time after a long "hiatus".
If the issue happened randomly, we can chalk it up to S3 hick-ups. The reason I don't think so is because during normal usage nothing ever happens. I can try querying randomly from within a month and Loki always respond within 5s.
Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.
We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.
Stalebots are also emotionless and cruel and can close issues which are still very relevant.
If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.
We regularly sort for closed issues which have a stale
label sorted by thumbs up.
We may also:
- Mark issues as
revivable
if we think it's a valid issue but isn't something we are likely to prioritize in the future (the issue will still remain closed). - Add a
keepalive
label to silence the stalebot if the issue is very common/popular/important.
We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.
Update on this: We have since found and fixed other issues, our Loki has been working stably for a while now. Even so, I'm keeping this because we still haven't found the root cause.
Sorry, misclicked 😄
Update on this: We have since found and fixed other issues, our Loki has been working stably for a while now. Even so, I'm keeping this because we still haven't found the root cause.
What was your fix for this error finally?
Update on this: We have since found and fixed other issues, our Loki has been working stably for a while now. Even so, I'm keeping this because we still haven't found the root cause.
What was your fix for this error finally?
This config: storage_config.aws.http_config.response_header_timeout: 5s
How did you do the same configuration in kubernetes like where did you apply this change as I am using helm I am not getting exactly where I need to update.. Please help
@sumit480 Im using helm and I am doing the changes in the config map created in the namespace.
Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.
We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.
Stalebots are also emotionless and cruel and can close issues which are still very relevant.
If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.
We regularly sort for closed issues which have a stale
label sorted by thumbs up.
We may also:
- Mark issues as
revivable
if we think it's a valid issue but isn't something we are likely to prioritize in the future (the issue will still remain closed). - Add a
keepalive
label to silence the stalebot if the issue is very common/popular/important.
We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.
i have the same issue
We tried adding
storage_config.aws.http_config.response_header_timeout: 5s
, which eliminates the issue when queriers restart. However, issues still happen every once in a while. It's hard to know exactly when, but I observe that it happens when querying for the first time after a long "hiatus". If the issue happened randomly, we can chalk it up to S3 hick-ups. The reason I don't think so is because during normal usage nothing ever happens. I can try querying randomly from within a month and Loki always respond within 5s.
I bumped it to 15s, but I run into this issue quiet fast.
Using the following config:
config:
schema_config:
configs:
- from: 2020-10-24
store: aws
object_store: s3
schema: v11
index:
prefix: index_
period: 24h
tags: {}
limits_config:
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 720h
chunk_store_config:
max_look_back_period: 672h
table_manager:
chunk_tables_provisioning:
inactive_read_throughput: 1
inactive_write_throughput: 1
provisioned_read_throughput: 5
provisioned_write_throughput: 5
index_tables_provisioning:
inactive_read_throughput: 1
inactive_write_throughput: 1
provisioned_read_throughput: 5
provisioned_write_throughput: 5
retention_deletes_enabled: true
retention_period: 672h
storage_config:
aws:
s3: s3://region/[...]
http_config:
response_header_timeout: 15s
dynamodb:
dynamodb_url: dynamodb://region
Hi, any updates ?
Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.
We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.
Stalebots are also emotionless and cruel and can close issues which are still very relevant.
If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.
We regularly sort for closed issues which have a stale
label sorted by thumbs up.
We may also:
- Mark issues as
revivable
if we think it's a valid issue but isn't something we are likely to prioritize in the future (the issue will still remain closed). - Add a
keepalive
label to silence the stalebot if the issue is very common/popular/important.
We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.
I have the same issue, any updates?
The same issue is on version 2.6.1. Any updates?
I have the same issue, any updates?
Little clarification.
To set option parallelism
to value 2 in limits is a bad idea:
limits_config:
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 168h
max_cache_freshness_per_query: 10m
max_query_parallelism: 2
Because by default, it is 32, parallelism is the main feature of Loki, and higher values gain faster processing.
I have experienced better performance using max_query_parallelism: 32
than 256 for example.
https://github.com/grafana/loki/blob/5e134f3b6d33b2f9b485cde7a70f160704eae5ba/production/ksonnet/loki/config.libsonnet#L212-L214
Hi, I am facing the same error as well. Any updates?
Has anyone been able to figure out the issue regarding this?
Facing the same issue, any update on this?
any update on this?
Same issue for me
Have you guys tried tweaking this value storage_config.aws.http_config.response_header_timeout . I think it was around 15. It worked for me sometime back.
@manish-nference is that also applicable for gcs?
@ahmedzidan : The gcs has a different variable for the same. In gcs config request_timeout should work.
I have this value set to 300s but still have the same issue
Hi, I am not on the cloud (aws/gcp/...) but a self-hosted kubernetes. Is it a different "timeout" to set?
Thanks
Same Issue with Loki Running over EKS in AWS. We get 504 at time when querying data for large interval.
Sametime have seen below pods restarting Querier and Memcache
Do we have any fix for this?
Hello @ahmedzidan
I have set the request_timeout but did not work. Can any help me to resolve the issue?
Let me describe the issue.
In Grafana dashboard when we click on "labels filter" then getting the error "failed to call resources". I have attached the screen capture for the same. Whenever i am facing the issue the following is getting the loki-querier pods.
level=error ts=2023-09-23T15:31:44.947636692Z caller=series_index_store.go:527 org_id=fake msg="error querying storage" err="context canceled" ts=2023-09-23T15:31:44.947715472Z caller=spanlogger.go:80 user=fake method=SeriesStore.LabelNamesForMetricName level=error msg=lookupLabelNamesBySeries err="context canceled" ts=2023-09-23T15:31:44.947800948Z caller=spanlogger.go:80 user=fake method=query.Label level=info org_id=fake latency=slow query_type=labels length=1h0m0s duration=30.498016859s status=499 label= throughput=0B total_bytes=0B total_entries=0 level=error ts=2023-09-23T15:31:44.94793431Z caller=frontend_processor.go:145 msg="error processing requests" err=EOF
Also, we have checked the bucket quotas limits and CPU and Memory limits. All are good while checking the log.
In non-peak hours, we're checking the log without any issue in Garfana dashboard. we're facing the issue in peak hours.