loki icon indicating copy to clipboard operation
loki copied to clipboard

When querier restarts, it always get error `msg="error querying storage" err="context canceled"` on the first query

Open lackhoa opened this issue 3 years ago • 18 comments

Describe the bug I run Loki on Kubernetes. When a querier pod restarts, it always takes extremely long to process even the simplest query (resulting in 502 gateway timeout). I observe this issue happening regardless of the number of querier pods or resource limits.

If the queriers have been running for a while, then there is no big issue. However, I often have to restart Loki to apply new configuration (there is no hot reload), so this is a concern.

To Reproduce Steps to reproduce the behavior:

  1. Started Loki 2.4.1.
  2. Restart one querier pod.
  3. Query: content of the query doesn't matter, even reduce time range to 5m.

Expected behavior Here are some logs:

loki-querier-0 loki level=info ts=2021-11-17T18:47:26.295571342Z caller=memberlist_client.go:506 msg="joined memberlist cluster" reached_nodes=2
loki-querier-0 loki level=info ts=2021-11-17T18:47:30.961963639Z caller=table_manager.go:181 msg="downloading all files for table index_18948"
loki-querier-0 loki level=error ts=2021-11-17T18:48:30.956499851Z caller=frontend_processor.go:69 msg="error processing requests" address=100.64.57.240:9095 err="rpc error: code = Unknown desc = context canceled"
loki-querier-0 loki level=error ts=2021-11-17T18:48:30.956751555Z caller=chunk_store.go:524 org_id=fake msg="error querying storage" err="context canceled"
loki-querier-0 loki level=info ts=2021-11-17T18:48:30.957981931Z caller=metrics.go:92 org_id=fake latency=slow query="{log_type=\"orch\"} |= \"error\"" query_type=filter range_type=range length=5m0s step=1s duration=59.997894785s status=400 limit=30 returned_lines=0 throughput=253B total_bytes=15kB
loki-querier-0 loki level=error ts=2021-11-17T18:48:30.95859867Z caller=frontend_processor.go:145 msg="error processing requests" err=EOF
loki-querier-0 loki level=info ts=2021-11-17T18:48:41.276386955Z caller=pool.go:171 msg="removing stale client" addr=100.64.57.62:9095
loki-querier-0 loki level=error ts=2021-11-17T18:49:30.9608462Z caller=frontend_processor.go:69 msg="error processing requests" address=100.64.64.139:9095 err="rpc error: code = Unknown desc = context canceled"
loki-querier-0 loki level=error ts=2021-11-17T18:49:30.960897854Z caller=chunk_store.go:524 org_id=fake msg="error querying storage" err="context canceled"
loki-querier-0 loki level=info ts=2021-11-17T18:49:30.961277887Z caller=metrics.go:92 org_id=fake latency=slow query="{log_type=\"orch\"} |= \"error\"" query_type=filter range_type=range length=5m0s step=1s duration=59.99984329s status=400 limit=30 returned_lines=0 throughput=0B total_bytes=0B
loki-querier-0 loki level=error ts=2021-11-17T18:49:30.961486822Z caller=frontend_processor.go:145 msg="error processing requests" err=EOF
loki-querier-0 loki level=info ts=2021-11-17T18:49:35.223591683Z caller=util.go:109 msg="downloaded file loki-ingester-0-1637163268267231335-1637172900.gz from table index_18948"
loki-querier-0 loki level=info ts=2021-11-17T18:49:35.2292538Z caller=util.go:109 msg="downloaded file loki-ingester-1-1637162924597410047-1637172900.gz from table index_18948"
loki-querier-0 loki level=info ts=2021-11-17T18:49:35.244990554Z caller=util.go:109 msg="downloaded file compactor-1637173543.gz from table index_18948"

It seems from the logs that downloading all files for table index_18948 takes too long so they all time out. But those files are tiny and could be downloaded in <1s.

# aws s3 ls $BUCKET/index/index_18948/
2021-11-17 20:25:44      72085 compactor-1637173543.gz
2021-11-17 20:30:00       1662 loki-ingester-0-1637163268267231335-1637172900.gz
2021-11-17 21:03:49       1073 loki-ingester-0-1637174924055016748-1637175076.gz
2021-11-17 20:31:45       1068 loki-ingester-1-1637162924597410047-1637172900.gz
2021-11-17 21:01:49       1650 loki-ingester-1-1637174853937532883-1637175126.gz
2021-11-17 21:01:49       2218 loki-ingester-1-1637174853937532883-1637175600.gz

After the querier has been running for a while, download happens really really fast:

loki-querier-0 loki level=info ts=2021-11-18T07:10:12.219373707Z caller=table_manager.go:181 msg="downloading all files for table index_18946"
loki-querier-0 loki level=info ts=2021-11-18T07:10:12.332443982Z caller=util.go:109 msg="downloaded file loki-ingester-1-1636987151194890520-1637023500.gz from table index_18946"
loki-querier-0 loki level=info ts=2021-11-18T07:10:12.368661753Z caller=util.go:109 msg="downloaded file loki-ingester-0-1636987229036284040-1637023500.gz from table index_18946"
loki-querier-0 loki level=info ts=2021-11-18T07:10:12.407732991Z caller=util.go:109 msg="downloaded file compactor-1637023954.gz from table index_18946"

Here is the size of those files

# aws s3 ls $BUCKET/index/index_18946/
2021-11-16 02:52:36     102126 compactor-1637023954.gz
2021-11-16 03:01:30       1070 loki-ingester-0-1636987229036284040-1637023500.gz
2021-11-16 03:01:12        959 loki-ingester-1-1636987151194890520-1637023500.gz

Environment:

  • Infrastructure: Kubernetes
  • Deployment tool: Helm, based on the loki-distributed chart.

Screenshots, Promtail config, or terminal output Here's the logcli output:

# logcli query --stats --since 5m '{log_type="orch"} |= "error"'
...
Error response from server: <html>
<head><title>504 Gateway Time-out</title></head>
<body>
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx/1.19.10</center>
</body>
</html>
 (<nil>) attempts remaining: 0
Query failed: Run out of attempts while querying the server

Here's my config:

auth_enabled: false

server:
  http_listen_port: 3100
  log_level: info

distributor:
  ring:
    kvstore:
      store: memberlist

memberlist:
  join_members:
    - loki-memberlist

ingester:
  lifecycler:
    ring:
      kvstore:
        store: memberlist
      replication_factor: 1
  chunk_encoding: snappy
  chunk_retain_period: 1m
  # Disable deprecated handoff feature, use wal instead
  max_transfer_retries: 0
  wal:
    enabled: true
    dir: /var/loki-wal

limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  max_cache_freshness_per_query: 10m
  max_query_parallelism: 2

schema_config:
  configs:
  - from: "2021-10-08"
    store: boltdb-shipper
    object_store: s3
    schema: v11
    index:
      prefix: index_
      period: 24h

query_range:
  align_queries_with_step: true
  max_retries: 5
  split_queries_by_interval: 15m

frontend_worker:
  frontend_address: loki-query-frontend:9095
  parallelism: 2

frontend:
  log_queries_longer_than: 5s
  compress_responses: true
  tail_proxy_url: http://loki-querier:3100

lackhoa avatar Nov 17 '21 19:11 lackhoa

We tried adding storage_config.aws.http_config.response_header_timeout: 5s, which eliminates the issue when queriers restart. However, issues still happen every once in a while. It's hard to know exactly when, but I observe that it happens when querying for the first time after a long "hiatus". If the issue happened randomly, we can chalk it up to S3 hick-ups. The reason I don't think so is because during normal usage nothing ever happens. I can try querying randomly from within a month and Loki always respond within 5s.

lackhoa avatar Nov 29 '21 08:11 lackhoa

Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

  • Mark issues as revivable if we think it's a valid issue but isn't something we are likely to prioritize in the future (the issue will still remain closed).
  • Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.

stale[bot] avatar Jan 09 '22 05:01 stale[bot]

Update on this: We have since found and fixed other issues, our Loki has been working stably for a while now. Even so, I'm keeping this because we still haven't found the root cause.

lackhoa avatar Jan 10 '22 13:01 lackhoa

Sorry, misclicked 😄

lackhoa avatar Jan 10 '22 13:01 lackhoa

Update on this: We have since found and fixed other issues, our Loki has been working stably for a while now. Even so, I'm keeping this because we still haven't found the root cause.

What was your fix for this error finally?

martensson avatar Jan 21 '22 15:01 martensson

Update on this: We have since found and fixed other issues, our Loki has been working stably for a while now. Even so, I'm keeping this because we still haven't found the root cause.

What was your fix for this error finally?

This config: storage_config.aws.http_config.response_header_timeout: 5s

lackhoa avatar Jan 21 '22 15:01 lackhoa

How did you do the same configuration in kubernetes like where did you apply this change as I am using helm I am not getting exactly where I need to update.. Please help

sumit480 avatar Feb 23 '22 07:02 sumit480

@sumit480 Im using helm and I am doing the changes in the config map created in the namespace.

manu756 avatar Apr 02 '22 19:04 manu756

Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

  • Mark issues as revivable if we think it's a valid issue but isn't something we are likely to prioritize in the future (the issue will still remain closed).
  • Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.

stale[bot] avatar May 02 '22 21:05 stale[bot]

i have the same issue

wkshare avatar Jun 08 '22 02:06 wkshare

We tried adding storage_config.aws.http_config.response_header_timeout: 5s, which eliminates the issue when queriers restart. However, issues still happen every once in a while. It's hard to know exactly when, but I observe that it happens when querying for the first time after a long "hiatus". If the issue happened randomly, we can chalk it up to S3 hick-ups. The reason I don't think so is because during normal usage nothing ever happens. I can try querying randomly from within a month and Loki always respond within 5s.

I bumped it to 15s, but I run into this issue quiet fast.

Using the following config:

config:
  schema_config:
    configs:
    - from: 2020-10-24
      store: aws
      object_store: s3
      schema: v11
      index:
        prefix: index_
        period: 24h
        tags: {}
  limits_config:
    enforce_metric_name: false
    reject_old_samples: true
    reject_old_samples_max_age: 720h
  chunk_store_config:
    max_look_back_period: 672h
  table_manager:
    chunk_tables_provisioning:
      inactive_read_throughput: 1
      inactive_write_throughput: 1
      provisioned_read_throughput: 5
      provisioned_write_throughput: 5
    index_tables_provisioning:
      inactive_read_throughput: 1
      inactive_write_throughput: 1
      provisioned_read_throughput: 5
      provisioned_write_throughput: 5
    retention_deletes_enabled: true
    retention_period: 672h
  storage_config:
    aws:
      s3: s3://region/[...]
      http_config: 
        response_header_timeout: 15s
      dynamodb:
        dynamodb_url: dynamodb://region

sbkg0002 avatar Jul 06 '22 13:07 sbkg0002

Hi, any updates ?

LinTechSo avatar Jul 12 '22 13:07 LinTechSo

Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

  • Mark issues as revivable if we think it's a valid issue but isn't something we are likely to prioritize in the future (the issue will still remain closed).
  • Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.

stale[bot] avatar Aug 13 '22 12:08 stale[bot]

I have the same issue, any updates?

R-Studio avatar Sep 06 '22 14:09 R-Studio

The same issue is on version 2.6.1. Any updates?

PRIHLOP avatar Sep 15 '22 14:09 PRIHLOP

I have the same issue, any updates?

effy-coding avatar Sep 16 '22 05:09 effy-coding

Little clarification.

To set option parallelism to value 2 in limits is a bad idea:

limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  max_cache_freshness_per_query: 10m
  max_query_parallelism: 2

Because by default, it is 32, parallelism is the main feature of Loki, and higher values gain faster processing.

PRIHLOP avatar Sep 19 '22 10:09 PRIHLOP

I have experienced better performance using max_query_parallelism: 32 than 256 for example. https://github.com/grafana/loki/blob/5e134f3b6d33b2f9b485cde7a70f160704eae5ba/production/ksonnet/loki/config.libsonnet#L212-L214

alexandre1984rj avatar Sep 21 '22 23:09 alexandre1984rj

Hi, I am facing the same error as well. Any updates?

danielq987 avatar Sep 26 '22 14:09 danielq987

Has anyone been able to figure out the issue regarding this?

manish-nference avatar Dec 08 '22 07:12 manish-nference

Facing the same issue, any update on this?

chavan-suraj avatar Jan 25 '23 08:01 chavan-suraj

any update on this?

autokilla47 avatar Mar 07 '23 14:03 autokilla47

Same issue for me

ahmedzidan avatar Mar 08 '23 05:03 ahmedzidan

Have you guys tried tweaking this value storage_config.aws.http_config.response_header_timeout . I think it was around 15. It worked for me sometime back.

manish-nference avatar Mar 08 '23 05:03 manish-nference

@manish-nference is that also applicable for gcs?

ahmedzidan avatar Mar 08 '23 05:03 ahmedzidan

@ahmedzidan : The gcs has a different variable for the same. In gcs config request_timeout should work.

manish-nference avatar Mar 08 '23 05:03 manish-nference

I have this value set to 300s but still have the same issue

ahmedzidan avatar Mar 08 '23 05:03 ahmedzidan

Hi, I am not on the cloud (aws/gcp/...) but a self-hosted kubernetes. Is it a different "timeout" to set?

Thanks

yellowhat avatar Mar 28 '23 18:03 yellowhat

Same Issue with Loki Running over EKS in AWS. We get 504 at time when querying data for large interval.

Sametime have seen below pods restarting Querier and Memcache

Do we have any fix for this?

ganesh-kumarmt avatar Mar 29 '23 19:03 ganesh-kumarmt

Hello @ahmedzidan

I have set the request_timeout but did not work. Can any help me to resolve the issue?

Let me describe the issue.

In Grafana dashboard when we click on "labels filter" then getting the error "failed to call resources". I have attached the screen capture for the same. Whenever i am facing the issue the following is getting the loki-querier pods.

level=error ts=2023-09-23T15:31:44.947636692Z caller=series_index_store.go:527 org_id=fake msg="error querying storage" err="context canceled" ts=2023-09-23T15:31:44.947715472Z caller=spanlogger.go:80 user=fake method=SeriesStore.LabelNamesForMetricName level=error msg=lookupLabelNamesBySeries err="context canceled" ts=2023-09-23T15:31:44.947800948Z caller=spanlogger.go:80 user=fake method=query.Label level=info org_id=fake latency=slow query_type=labels length=1h0m0s duration=30.498016859s status=499 label= throughput=0B total_bytes=0B total_entries=0 level=error ts=2023-09-23T15:31:44.94793431Z caller=frontend_processor.go:145 msg="error processing requests" err=EOF

Also, we have checked the bucket quotas limits and CPU and Memory limits. All are good while checking the log.

In non-peak hours, we're checking the log without any issue in Garfana dashboard. we're facing the issue in peak hours.

MicrosoftTeams-image

Sagarbud avatar Sep 25 '23 10:09 Sagarbud