loki icon indicating copy to clipboard operation
loki copied to clipboard

Query from Grafana to Loki returning 502 when using 7 days range (too much data to parse?)

Open some-random-dude123 opened this issue 2 years ago • 38 comments

Hello everyone

Describe the bug Loki Version: 2.2.0 Grafana Version: 7.5.6

I'm running a query from Grafana to Loki that looks like this (investigating for a neo4j timeout but whatever): {namespace="neo4j", container_name="neo4j-v1-neo4j"} |~ "timed out" The query works well when using the 2 days range, but if I go to 7 days range I'm hitting a 502 error

image

I'm not sure here if i'm facing resources limitation or anything else. Considering it fails at 49s, I suspect this is not a timeout issue (with the load balancer or anything like it) Our Loki machine is a t3.xlarge (4vcpu 16gb of ram), it does ingest a lot of logs, but CPU is pretty sleepy and memory is around 27%. Loki is running as a container on it (with no resource limitations). Looking at the metrics of the underlying server, it doesn't really work hard while processing the query.

I've found this error in the container log when the query fail

level=info ts=2021-07-20T14:35:43.738263796Z caller=metrics.go:91 org_id=fake traceID=5cf702b9223409e latency=slow query="{namespace=\"neo4j\", container_name=\"neo4j-v1-neo4j\"} |~ \"timed out\"" query_type=filter range_type=range length=168h0m1s step=5m0s duration=49.731710927s status=200 limit=1000 returned_lines=34 throughput=348MB total_bytes=17GB
2021-07-20 14:35:43.738667 I | http: superfluous response.WriteHeader call from github.com/opentracing-contrib/go-stdlib/nethttp.(*statusCodeTracker).WriteHeader (status-code-tracker.go:17)
level=warn ts=2021-07-20T14:35:43.738720096Z caller=logging.go:60 traceID=5cf702b9223409e msg="GET /loki/api/v1/query_range?direction=BACKWARD&limit=1000&query=%7Bnamespace%3D%22neo4j%22%2C%20container_name%3D%22neo4j-v1-neo4j%22%7D%20%7C~%20%22timed%20out%22&start=1626186893000000000&end=1626791694000000000&step=300 49.73240024s, error: write tcp hidden:hidden->hidden:hidden: i/o timeout ws: false; Accept: application/json, text/plain, */*; Accept-Encoding: gzip, deflate, br; Accept-Language: en-US,en;q=0.9,fr;q=0.8; Cache-Control: no-cache; Pragma: no-cache; Sec-Ch-Ua: \" Not;A Brand\";v=\"99\", \"Google Chrome\";v=\"91\", \"Chromium\";v=\"91\"; Sec-Ch-Ua-Mobile: ?0; Sec-Fetch-Dest: empty; Sec-Fetch-Mode: cors; Sec-Fetch-Site: same-origin; User-Agent: Grafana/7.5.6; X-Amzn-Trace-Id: Root=1-60f6df0d-42c5460655f1981f39875589; X-Forwarded-For: hidden; X-Grafana-Org-Id: 1; "

To Reproduce Steps to reproduce the behavior: There is no exact steps on reproduction, have a loki with large data and run a query too big for the instance ?

Expected behavior Query should return result properly on a larger time range and not fail.

Environment:

  • Infrastructure: AWS EKS
  • Deployment tool: Kubectl / terraform

Any hints that I could investigate to resolve this issue ? I may end up boosting the machine even if I have no sign of resource limitation, but I would like other opinions on the matter, maybe that's something else ?

Thanks !

some-random-dude123 avatar Jul 20 '21 14:07 some-random-dude123

I faced the same issue. I tested Loki in a distributed environment in a Microk8s cluster in GCP and deployed Loki in microservice mode.

  • Loki distributor: 3 pods
  • Loki ingester: 3 pods
  • Loki gateway: 1 pod
  • Loki queried: 3 pods
  • Loki query frontend: 1 pod
  • MinIO: 3 pods

4 GCP machines with these specs:

Machine type: e2-standard-2 (2 vCPUs, 8 GB memory) Image: ubuntu-2004-focal-v20210702 Zone: us-east4-c Boot disk:

  • Size: 20GB
  • Type: Balanced persistent disk

The ingestion rate was 50.000 log lines per second, with at least 10 minutes of log ingestion the Log rate dashboard shows an 502 error code when retrieving data

image

image

Abuelodelanada avatar Jul 21 '21 16:07 Abuelodelanada

I have the exact same problem. At first I thought the problem was with the grafana timeouts and increased them. After that I started getting 504 error.

mrgreyves avatar Jul 23 '21 09:07 mrgreyves

I encountered the same problem here. I deployed a distributed version of Loki using the loki-distributed helm chart on a 8-node(64 processors,128g,mem,12T nvme each ) k3s cluster. Loki Version: 2.2.1 Loki distributor: 4 pods Loki ingester: 4 pods Loki gateway: 1 pod Loki compactor: 1 pod Loki querier: 8 pods Loki query frontend: 2 pods Loki memcached-chunks: 4 pods Loki memcached-frontend: 2 pods Loki memcached-index-queries:1 pod Loki memcached-index-writes:1 pod Loki ruler:1 pod MinIO: 8 pods

The index rate is around 20k lines/s. Every stream has 4 to 6 labels, e.g. namespace, container, pod, hostname When the query range >= 7 days, grafana returns a 502 error like below at exactly 60 seconds.

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head> <title>502 Proxy Error</title> </head><body> <h1>Proxy Error</h1> <p>The proxy server received an invalid response from an upstream server.<br /> The proxy server could not handle the request<p>Reason: <strong>Error reading from remote server</strong></p></p> </body></html>

Here are some logs from the Nginx gateway

2021/07/21 17:23:47 [error] 9#9: *12753467 upstream timed out (110: Operation timed out) while reading response header from upstream, client: <DELETE_IP_HERE>, server: , request: "GET /loki/api/v1/query_range?direction=BACKWARD&limit=1000&query=%7Bdata_type%3D%22logs-syslog%22%7D&start=1626283367000000000&end=1626888168000000000&step=600 HTTP/1.1", upstream: "http://<DELETE_IP_HERE>:3100/loki/api/v1/query_range?direction=BACKWARD&limit=1000&query=%7Bdata_type%3D%22logs-syslog%22%7D&start=1626283367000000000&end=1626888168000000000&step=600", host: "loki.<DELETE_DOMAIN_HERE>:80"

2021/07/24 18:47:44 [error] 9#9: *26329 upstream prematurely closed connection while reading response header from upstream, client: <DELETE_IP_HERE>, server: , request: "GET /loki/api/v1/query_range?direction=BACKWARD&limit=1000&query=%7Bdata_type%3D%22logs-containers%22%2Ccontainer%3D%22pg-persister%22%7D&start=1625942821000000000&end=1627152422000000000&step=600 HTTP/1.1", upstream: "http://<DELETE_IP_HERE>:3100/loki/api/v1/query_range?direction=BACKWARD&limit=1000&query=%7Bdata_type%3D%22logs-containers%22%2Ccontainer%3D%22pg-persister%22%7D&start=1625942821000000000&end=1627152422000000000&step=600", host: "loki.<DELETE_DOMAIN_HERE>:80"

Then I found logs from the loki-distributed-query-frontend pod

level=warn ts=2021-07-24T18:47:44.676008137Z caller=logging.go:71 traceID=42e37adcf1ef9caa msg="GET /loki/api/v1/query_range?direction=BACKWARD&limit=1000&query=%7Bdata_type%3D%22logs-containers%22%2Ccontainer%3D%22pg-persister%22%7D&start=1625942821000000000&end=1627152422000000000&step=600 (500) 42.72289471s Response: \"write /var/loki/boltdb-cache/loki_index_18820/compactor-1626190919.gz: no space left on device\\n\" ws: false; Accept: application/json, text/plain, */*; Accept-Encoding: gzip, deflate, br; Accept-Language: en-US,en;q=0.5; Connection: close; Sec-Fetch-Dest: empty; Sec-Fetch-Mode: cors; Sec-Fetch-Site: same-origin; User-Agent: Grafana/7.5.5; X-Forwarded-For: 192.168.87.133; X-Forwarded-Host: loki-perlmutter.nersc.gov:80; X-Forwarded-Port: 80; X-Forwarded-Proto: http; X-Forwarded-Server: traefik-744fb8464c-c529l; X-Grafana-Org-Id: 6; X-Real-Ip: <DELETE_IP_HERE>; "

And found out the filesystem of querier-0 is full of boltdb shipper cache

[root@shasta-k3s-1 ~]# kubectl exec -it -n perlmutter-loki perlmutter-loki-distributed-querier-0 -- df -h |grep /var/loki
                         50.0G     46.7G      3.3G  93% /var/loki
[root@shasta-k3s-1 ~]# 

How can I clear boltdb shipper cache safely? And prevent this happen again?

Meanwhile, I noticed only one querier used a huge amount of cpu and mem resources when the query was being executed. I suspect the query was not distributed to multiple queriers through the query frontend.

Below are some of the config I changed in values.yaml

      boltdb_shipper:
        shared_store: s3
        active_index_directory: /var/loki/index
        cache_location: /var/loki/boltdb-cache
        cache_ttl: 12h

    querier:
      query_timeout: 10m
      query_ingesters_within: 1h
      engine:
        timeout: 10m

    query_range:
      align_queries_with_step: true
      max_retries: 5
      split_queries_by_interval: 15m
      cache_results: true
      parallelise_shardable_queries: true
      results_cache:
        cache:
          memcached_client:
            consistent_hash: true
            host: {{ include "loki.memcachedFrontendFullname" . }}
            max_idle_conns: 16
            service: http
            timeout: 500ms
            update_interval: 1m


    frontend_worker:
      frontend_address: {{ include "loki.queryFrontendFullname" . }}:9095
      grpc_client_config:
        max_recv_msg_size: 104857600
        max_send_msg_size: 104857600

    frontend:
      log_queries_longer_than: 5s
      compress_responses: true
      tail_proxy_url: http://{{ include "loki.querierFullname" . }}:3100

deng47 avatar Jul 25 '21 06:07 deng47

Guys, are you running on GKE under GLB? For me a solution to configure GCP backend configuration via annotations helped:

apiVersion: cloud.google.com/v1beta1
kind: BackendConfig
metadata:
  name: backend-config
spec:
  timeoutSec: 900

and annotation under ingress or service:

cloud.google.com/backend-config: '{"ports": {"80":"backend-config"}}'

also SVC recreation is a must.

Hope that helps!

DovilasK avatar Jul 30 '21 23:07 DovilasK

Hello @DovilasK

I'm using in GCP, and locally using microk8s.

Abuelodelanada avatar Aug 16 '21 18:08 Abuelodelanada

Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

  • Mark issues as revivable if we think it's a valid issue but isn't something we are likely to prioritize in the future (the issue will still remain closed).
  • Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.

stale[bot] avatar Sep 19 '21 19:09 stale[bot]

This issue is still present.

Abuelodelanada avatar Sep 20 '21 01:09 Abuelodelanada

Facing the same issue in AWS-based setup (k8s on EC2, provisioned by kops. Using loki-distributed helm chart). Quite often I even face this with a small queries (1 hour time frame), when querier could get pretty much everything directly from ingestors, as soon as we are using 2h duration for chunks.

artem-timchenko avatar Nov 15 '21 11:11 artem-timchenko

We experiencing the same issue. GKE cluster 1.19.15-gke.1801 with helm chart loki-distributed (the same issue with 2.2 and latest 2.4.1 with applied changes described here https://grafana.com/docs/loki/latest/upgrading/#240).

I assumed initially that it could be the https://github.com/k8spin/loki-multi-tenant-proxy component which we are using between our Grafana and loki but I've removed this component from our deployments and experiencing the same issue.

I've also tried to increased the loki-querier resources (multiple replicas with 4GB memory each to exclude memory issues) but still get that 502 and pods don't show any memory issues.

Tried also direct queries from some debug pod to loki API (as to exclude that this is something with Grafana) but the same problem. With quite long range queries (usually older than 7 days) get the 502 error.

I've review issues and a couple of people have the same problem without any clue what it could be:

https://github.com/grafana/loki/issues/4591 https://github.com/grafana/loki/issues/4582

Maybe related also (but not sure):

https://github.com/grafana/loki/issues/4754 https://github.com/grafana/loki/issues/4779 https://github.com/grafana/loki/issues/4721

@cyriltovena @slim-bean maybe you guys when find a moment could put any hint where to look to solve this issue (or what really potential problem could be). I assume this is related with some specific misconfiguration (I experiencing this issue in some quite small clusters like 6xn1-standard-8 nodes and not a lot of traffic). If somebody could help here I will be able to provide all configuration details (currently GCS with boltdb-shipper) and try different possible solutions. Would be nice to find what this is about.

Thx for any input.

nowyp avatar Nov 29 '21 18:11 nowyp

There could be a lot of reason.

If there is only few queriers that are doing work then possibly I'll be looking at increase or decreasing the worker config for querier.

https://grafana.com/docs/loki/latest/configuration/#frontend_worker_config

Try to increase parallelism, there were also a bug where this config would not set correctly in which case you need to set match_max_concurrent to false. see https://github.com/grafana/loki/pull/4761

The max_query_parallelism might also need to be increase.

cyriltovena avatar Nov 30 '21 10:11 cyriltovena

@cyriltovena thanks for hints. I'm making those tests on the separated cluster where there is no other queries (it it dedicated separated environment for those issue if I can say that).

The problem happens when I query looks like this: {namespace="mynamespace"} |= "something something" (I know of course how Loki is design and mostly the labels should be used but explain that developers who worked before years with ES for that kind of stuff 😄 ).

But the issue disappear when I have a query like this: {namespace="mynamespace"} |= "something" (so single words expressions) - then I can query without problems even for month and more - probably this is related to scanning.

I've made the suggested changes (Loki 2.4.1 based on loki-distributed Chart 0.39.3) and currently have a config like this (using GCS bucket as storage with boltdb):

      auth_enabled: true
      server:
        http_listen_port: 3100
        grpc_server_max_recv_msg_size: 20388078
        grpc_server_max_send_msg_size: 20388078
        log_level: warn
      distributor:
        ring:
          kvstore:
            store: memberlist
      memberlist:
        join_members:
          - {{ include "loki.fullname" . }}-memberlist
      ingester:
        lifecycler:
          ring:
            kvstore:
              store: memberlist
            replication_factor: 1
        chunk_idle_period: 30m
        chunk_block_size: 262144
        chunk_encoding: snappy
        chunk_retain_period: 1m
        max_transfer_retries: 0
        wal:
          dir: /var/loki/wal        
      limits_config:
        enforce_metric_name: false
        reject_old_samples: true
        reject_old_samples_max_age: 168h
        max_cache_freshness_per_query: 10m
        retention_period: {{ .Values.global.loki.retention.period }}
      schema_config:
        configs:
          - from: 2021-02-01
            store: boltdb-shipper
            object_store: gcs
            schema: v11
            index:
              prefix: index_
              period: 24h
      storage_config:
        gcs:
          bucket_name: mybucket-name
        boltdb_shipper:
          shared_store: gcs
          active_index_directory: /var/loki/index
          cache_location: /var/loki/cache
          cache_ttl: 24h # Can be increased for faster performance over longer query periods, uses more disk space
      chunk_store_config:
        max_look_back_period: 0s
      table_manager:
        retention_deletes_enabled: false
        retention_period: 0s
      query_range:
        align_queries_with_step: true
        max_retries: 5
        split_queries_by_interval: 15m
        cache_results: true
        results_cache:
          cache:
            enable_fifocache: true
            fifocache:
              max_size_items: 1024
              validity: 24h
      querier:
        max_concurrent: 25
      frontend_worker:
        frontend_address: {{ include "loki.queryFrontendFullname" . }}:9095
        parallelism: 25
      frontend:
        log_queries_longer_than: 5s
        compress_responses: true
        tail_proxy_url: http://{{ include "loki.querierFullname" . }}:3100
      compactor:
        shared_store: gcs
        retention_enabled: {{ .Values.global.loki.retention.enabled }}
      ruler:
        storage:
          type: local
          local:
            directory: /etc/loki/rules
        ring:
          kvstore:
            store: memberlist
        rule_path: /tmp/loki/scratch
        alertmanager_url: http://alertmanager-proxy.monitoring:8080
        enable_alertmanager_v2: true
        enable_api: true

nowyp avatar Dec 01 '21 20:12 nowyp

What's the query stats ? The max_concurrent should reflect more cpu allocated per querier.

In query range you should activate sharding.

cyriltovena avatar Dec 01 '21 21:12 cyriltovena

@cyriltovena where should match_max_concurrent: false be added in the Loki config ?

gmavadiy avatar Dec 01 '21 22:12 gmavadiy

@demandbase-gmavadiya looks like in frontend worker https://github.com/grafana/loki/blob/a0eb507fead4fab828330ab6ac54bc9f48f8a209/production/ksonnet/loki/config.libsonnet#L157

magiejen avatar Dec 02 '21 00:12 magiejen

So I tried it with match_max_concurrent: false and selected range greater than 24h and see a 502 response on query_range call.

gmavadiy avatar Dec 02 '21 03:12 gmavadiy

@cyriltovena I've adjusted max_concurrent in this case to 4 as have 4 cpus assigned to querier component pod.

So have below for testing:

  querier:
    replicas: 6
    resources:
      limits:
        cpu: 4000m
        memory: 4Gi
      requests:
        cpu: 4000m
        memory: 4Gi

And adjusted settings:

      querier:
        max_concurrent: 4

      frontend_worker:
        frontend_address: {{ include "loki.queryFrontendFullname" . }}:9095
        parallelism: 24 # tried different values here like 4, 16, 24 etc. (finally set 24 for 6 replicas of queries each 4vcpu)
        match_max_concurrent: false

Regarding the stats (two samples):

caller=spanlogger.go:87 org_id=myorgid Summary.BytesProcessedPerSecond="310 MB" Summary.LinesProcessedPerSecond=1205210 Summary.TotalBytesProcessed="28 GB" Summary.TotalLinesProcessed=107451110 Summary.ExecTime=1m29.155461166s

caller=spanlogger.go:87 org_id=myorgid Ingester.TotalReached=1346 Ingester.TotalChunksMatched=20 Ingester.TotalBatches=0 Ingester.TotalLinesSent=0 Ingester.HeadChunkBytes="115 kB" Ingester.HeadChunkLines=479 Ingester.DecompressedBytes="41 MB" Ingester.DecompressedLines=158975 Ingester.CompressedBytes="5.7 MB" Ingester.TotalDuplicates=0 Store.TotalChunksRef=8151 Store.TotalChunksDownloaded=8151 Store.ChunksDownloadTime=34m27.574269346s Store.HeadChunkBytes="0 B" Store.HeadChunkLines=0 Store.DecompressedBytes="28 GB" Store.DecompressedLines=107291656 Store.CompressedBytes="3.8 GB" Store.TotalDuplicates=0

---

ts=2021-12-02T11:10:12.524935501Z caller=spanlogger.go:87 org_id=myorgid Summary.BytesProcessedPerSecond="304 MB" Summary.LinesProcessedPerSecond=1178664 Summary.TotalBytesProcessed="28 GB" Summary.TotalLinesProcessed=107325013 Summary.ExecTime=1m31.056454484s

caller=spanlogger.go:87 org_id=myorgid Ingester.TotalReached=1346 Ingester.TotalChunksMatched=25 Ingester.TotalBatches=0 Ingester.TotalLinesSent=0 Ingester.HeadChunkBytes="34 kB" Ingester.HeadChunkLines=99 Ingester.DecompressedBytes="54 MB" Ingester.DecompressedLines=161983 Ingester.CompressedBytes="7.4 MB" Ingester.TotalDuplicates=0 Store.TotalChunksRef=8124 Store.TotalChunksDownloaded=8124 Store.ChunksDownloadTime=33m30.270536145s Store.HeadChunkBytes="0 B" Store.HeadChunkLines=0 Store.DecompressedBytes="28 GB" Store.DecompressedLines=107162931 Store.CompressedBytes="3.8 GB" Store.TotalDuplicates=0

I just wondering whether it is a some kind of problem with query_frontend proxy because from the stats as I understand the querier components processed and returned the result.

nowyp avatar Dec 02 '21 11:12 nowyp

I'm getting 502 errors as well with varying time ranges depending on the query. I use the simple scalable deployment mode with 2 read StatefulSets and 2 writes.

From the logs I read that the query might actually finished but it timed out? The response comes back with an error after 30s. Grafana uses Service DNS for the datasource so it's not a load balancer time out.

level=error ts=2021-12-02T11:43:07.433295431Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
level=info ts=2021-12-02T11:43:07.433499948Z caller=metrics.go:92 org_id=fake latency=fast query="{job=\"fluentbit\"} |= \"error\"" query_type=filter range_type=range length=71h59m59s step=2m0s duration=0s status=499 limit=1000 returned_lines=0 throughput=0B total_bytes=0B
level=info ts=2021-12-02T11:43:08.209164687Z caller=metrics.go:92 org_id=fake latency=slow query="{job=\"fluentbit\"} |= \"error\"" query_type=filter range_type=range length=71h59m59s step=2m0s duration=30.774608718s status=400 limit=1000 returned_lines=282 throughput=147MB total_bytes=4.5GB
level=error ts=2021-12-02T11:43:08.20935126Z caller=scheduler_processor.go:199 org_id=fake msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = context canceled" frontend=172.22.180.179:9095
level=error ts=2021-12-02T11:43:08.209389218Z caller=scheduler_processor.go:154 org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=172.22.180.179:9095

Additionally, oftentimes the query pods would get OOMKilled and I would love to know if there is a setting that controls how much memory it's allowed to use. It happens regardless of enable_fifocache, to be sure I had at one point all 4 of them disabled. Are there any other in-memory settings?

mrliptontea avatar Dec 02 '21 13:12 mrliptontea

Another observation is that sometimes logs will show up if I don't apply any pipe operations. For example, {namespace="foo",pod="bar"} will work but {namespace="foo",pod="bar"} |= "blah" will return a 502. Issue is also reproducible when searching for strings that do not exist in the logs.

gmavadiy avatar Dec 02 '21 18:12 gmavadiy

Another observation is that sometimes logs will show up if I don't apply any pipe operations. For example, {namespace="foo",pod="bar"} will work but {namespace="foo",pod="bar"} |= "blah" will return a 502. Issue is also reproducible when searching for strings that do not exist in the logs.

This actually makes sense because {namespace="foo",pod="bar"} is indexed so querier will know how to finish the query early, since by default limit (in Grafana) is 1000 lines. With pipe, especially if you query for something that doesn't occur often, it will need to download lots of chunks and scan them.

@cyriltovena any ideas how to extend the timeout and reduce memory usage? Can it be done with current settings?

mrliptontea avatar Dec 08 '21 16:12 mrliptontea

@mrliptontea I was getting the same 502 error with a metric query, and I was able to fix it for my specific case. I have a theory(loki devs, please correct me): for metric and pipe/filter queries, the query has to return ALL the logs before it can perform the metric/filter.

For log queries, if it hits an error (loki server limit), the query returns with whatever it has so far. However, unlike a log query, a metric/log query must return all the logs matching that selector before performing an operation. If the selector query hits an error, the selector query would get cancelled, and metric query would fail as well.

I had this metric query that was consistently hitting 502/504 (Context cancelled)

sum(count_over_time({job="my_job",log_level="error"}[10m])) by (host)

When I looked into simply the log query, I saw that the log query was actually hitting the loki server limit and returning only 5000 log lines

{job="server1",log_level="error"}

I increased max_entries_limit_per_query until the log query was returning all the logs rather than just returning the limit of the Loki server. (Note, that if you are querying on Grafana and want to verify via Explore, you would need to increase the Grafana log limit as well). Once I did this, my metric query returned completely fine on my dashboards.

I had another metric query that was returning 502/504 context cancelled. I did the same thing of only running the log query (with new server limits). Turns out, for that query, I was hitting a grpc limit. After I adjusted the grpc limit, THAT metric query also returned.

TLDR: If your metric/filter queries are getting cancelled, check if your log query is FINISHING successfully (ie. not hitting Loki server limit, grpc limit).

magiejen avatar Dec 10 '21 18:12 magiejen

It seems like this issue is also related to https://github.com/grafana/loki/issues/3524 As I described in https://github.com/grafana/loki/issues/3524#issuecomment-976602363 the main thing I don't quite understand is that why queriers, query-frontend report no errors whatsoever, query-frontend logs that it responds with a 200 response but for some reason the client (nginx proxy or grafana itself) still received a 502 or EOF error. I tried to look at code and in fact it seems to be very simple - there is just io.Copy of response directly to the client - https://github.com/grafana/loki/blob/e8d14157256918e623c1ebe232b8db7fb67b9284/pkg/lokifrontend/frontend/transport/handler.go#L140. It is really frustrating as we currently cannot understand what is going on wrong.

igorcoding avatar Dec 15 '21 22:12 igorcoding

I have investigated a little more into the problem recently, and it turns out that the problem is reproducible even with the querier direcly:

bash-5.1# curl -v -H 'X-Scope-OrgId: main-prod' 'http://10.233.73.111:3100/loki/api/v1/query_range?direction=BACKWARD&limit=500&query=%7Benv%3D%22prod%22%7D%20%7C%3D%20%22111111111%22&start=1643707663535000000&end=1643794063536000000&step=60'
*   Trying 10.233.73.111:3100...
* Connected to 10.233.73.111 (10.233.73.111) port 3100 (#0)
> GET /loki/api/v1/query_range?direction=BACKWARD&limit=500&query=%7Benv%3D%22prod%22%7D%20%7C%3D%20%22111111111%22&start=1643707663535000000&end=1643794063536000000&step=60 HTTP/1.1
> Host: 10.233.73.111:3100
> User-Agent: curl/7.80.0
> Accept: */*
> X-Scope-OrgId: main-prod
>
* Empty reply from server
* Closing connection 0
curl: (52) Empty reply from server

So apparently querier just closes the socket without any response. In the log output there is a 200 response as usual (I have also added a middle log line printing the response right before sending it to the client at https://github.com/grafana/loki/blob/525040a3265715ca8cfb52cd39108ccfac9f8af6/pkg/querier/http.go#L68-L71

level=info ts=2022-02-02T10:26:43.468724117Z caller=metrics.go:92 org_id=main-prod latency=slow query="{env=\"prod\"} |= \"111111111\"" query_type=filter range_type=range length=24h0m0.001s step=1m0s duration=1m12.569265783s status=200 limit=500 returned_lines=0 throughput=99MB total_bytes=7.2GB 
level=info ts=2022-02-02T12:41:36.824077284Z caller=marshal.go:41 msg="(kts) json response" data="{\"status\":\"success\",\"data\":{\"resultType\":\"streams\",\"result\":[],\"stats\":{\"summary\":{\"bytesProcessedPerSecond\":114428195,\"linesProcessedPerSecond\":160407,\"totalBytesProcessed\":7169210276,\"totalLinesProcessed\":10049907,\"execTime\":62.652480609},\"store\":{\"totalChunksRef\":21948,\"totalChunksDownloaded\":21948,\"chunksDownloadTime\":61.528777027,\"headChunkBytes\":0,\"headChunkLines\":0,\"decompressedBytes\":7169210276,\"decompressedLines\":10049907,\"compressedBytes\":1079183588,\"totalDuplicates\":0},\"ingester\":{\"totalReached\":0,\"totalChunksMatched\":0,\"totalBatches\":0,\"totalLinesSent\":0,\"headChunkBytes\":0,\"headChunkLines\":0,\"decompressedBytes\":0,\"decompressedLines\":0,\"compressedBytes\":0,\"totalDuplicates\":0}}}}"
level=debug ts=2022-02-02T10:26:43.496900653Z caller=logging.go:67 traceID=3272e60924e597df orgID=main-prod msg="GET /loki/api/v1/query_range?direction=BACKWARD&limit=500&query=%7Benv%3D%22prod%22%7D%20%7C%3D%20%22111111111%22&start=1643707663535000000&end=1643794063536000000&step=60 (200) 1m12.598655661s"

As you can see querier is ready to respond correctly, but all we got is empty reply from server.

querier.query_timeout and querier.engine.timeout have both the value of 5m.

Unfortunately code investigation didn't bring me any insights of what could have gone wrong. I am guessing there is still a timeout value of some kind in one of the middlewares that causes the socket to close. But it's really weird that it closes exactly when querier reports 200 response code. I would appreciate any help from loki devs to understand the root cause of the issue.

igorcoding avatar Feb 02 '22 12:02 igorcoding

@igorcoding can you try doing a jaeger trace ? That might help narrow down the issue.

gmavadiy avatar Feb 02 '22 18:02 gmavadiy

Ok, seems like I figured out what timeout value was causing an issue. There is a http_server_write_timeout setting which by default equals to 30s. And apparently if nothing is sent to response within 30s connection becomes closed, but in my case it closes just when the w.Write call occurs. From the docstring in http/server.go:

// WriteTimeout is the maximum duration before timing out
// writes of the response. It is reset whenever a new
// request's header is read. Like ReadTimeout, it does not
// let Handlers make decisions on a per-request basis.
// A zero or negative value means there will be no timeout.
WriteTimeout time.Duration

So, increasing the value of http_server_write_timeout to a more appropriate value (in my case it was 2m) solved the issue. Hope this will help someone else in the future.

igorcoding avatar Feb 03 '22 13:02 igorcoding

In my case, queries fail quicker than 30s, and the bigger problem seems to be OOM killer.

This is definitely related to amount of log lines in storage (s3), because in another cluster I have, with the same exact setup but much lower traffic thus fewer log lines produced, I can run the same query against Loki and eventually it will load the results. But on the cluster where there's high traffic and lots of logs it crashes the read Pod(s) almost instantly when I run the query.

Edit: After some more tweaking, extending max_query_length, and running a query over a ridiculously long period I can get a 502 error from my low traffic cluster's Loki exactly after 30 seconds, even though -print-config-stderr confirms that there isn't a single setting that's set to 30s because I bumped them all. This is with Grafana talking to Loki within cluster, directly via Service, no NGINX or load balancer in front of it.

mrliptontea avatar Feb 08 '22 22:02 mrliptontea

I have the exact same problem. In my case, my configuration is with GCS Backend

loki-linux-amd64[14413]: level=error ts=2022-02-10T18:57:03.226397424Z caller=retry.go:73 org_id=fake traceID=5ff7c24474ace112 msg="error processing request" try=0 err="context canceled"

marcusteixeira avatar Feb 10 '22 19:02 marcusteixeira

For all of you getting 30s timeouts and 502's please be sure to check the data proxy timeout in Grafana, the default of which is 30s and the result it returns is I believe a 504.

Also make sure in your Loki config you have these timeouts set high enough:

server:
  http_server_read_timeout: 300s
  http_server_write_timeout: 300s

For those of you seeing OOM crashes, you can reduce the amount of parallelism a single process will do by reducing

querier:
  max_concurrent: 

the default value is 10, you can lower this to process less data at once time (slower result but less memory)

Note there is a bug in the 2.4.x releases which missed the last backport unfortunately so it's not in 2.4.2 either where the match_max_concurrent setting doesn't work properly and for the time being you should also set

query_frontend:
  parallelism: 

To the same value as max_concurrent

slim-bean avatar Feb 10 '22 20:02 slim-bean

Good shout re: Grafana dataproxy. I have now successfully ran a query that took 48s. :tada:

However, I set the timeout to 300 but after exactly 1 minute I get a 504 in Grafana. With dataproxy logging enabled I found this:

# Grafana
t=2022-02-10T22:15:32+0000 lvl=eror msg="Data proxy error" logger=data-proxy-log userId=0 orgId=1 uname= path=/api/datasources/proxy/4/loki/api/v1/query_range remote_addr=<redacted> referer="<redacted>" error="http: proxy error: context canceled"
t=2022-02-10T22:15:32+0000 lvl=eror msg="Request Completed" logger=context userId=0 orgId=1 uname= method=GET path=/api/datasources/proxy/4/loki/api/v1/query_range status=502 remote_addr=<redacted> time_ms=60000 size=0 referer="<redacted>"

# Loki
level=error ts=2022-02-10T22:15:34.63861022Z caller=scheduler_processor.go:199 org_id=fake msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = context canceled" frontend=172.16.216.17:9095
level=error ts=2022-02-10T22:15:34.638634499Z caller=scheduler_processor.go:154 org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=172.16.216.17:9095
level=error ts=2022-02-10T22:15:39.443394053Z caller=scheduler_processor.go:199 org_id=fake msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = context canceled" frontend=172.16.216.17:9095
level=error ts=2022-02-10T22:15:39.443430442Z caller=scheduler_processor.go:154 org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=172.16.64.191:9095

Notice how Grafana saw a 502 status good 2-7 seconds before Loki errors, other logs indicate queries continued to run until the errors. I am really puzzled by this :confused:

mrliptontea avatar Feb 10 '22 22:02 mrliptontea

@mrliptontea I use loki-distributed helm chart for Loki and there is a gateway component in that (its nginx basically). I was seeing timeouts and hence modified

[proxy_read_timeout](http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_read_timeout)
[proxy_send_timeout](http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_send_timeout)

these values (as the default for them is 60s). So if you are using it, try modifying the values i mentioned and it should help.

Also, the error i saw on the gateway pod was following:

upstream timed out (110: Operation timed out) while reading response header from upstream

mehta-ankit avatar Feb 10 '22 23:02 mehta-ankit

(unfortunately) I don't use the helm chart, and I don't have the NGINX gateway, Grafana proxies the requests and they go straight to Kubernetes Service that selects Loki's read instances.

mrliptontea avatar Feb 11 '22 00:02 mrliptontea