loki icon indicating copy to clipboard operation
loki copied to clipboard

loki failing when querying huge data

Open isshwar opened this issue 3 years ago • 42 comments

Hello,

loki querier is failing with the below when trying to pull logs (for example: last 30 days)

504 Gateway Time-out

504 Gateway Time-out


nginx/1.19.10

We are running distributed set up with 4 querier pods each with 4Gi ram. when I try pull log data for 30days, seeing an error message on grafana. I have increased timeouts on nginx proxy that is between loki and grafana, still it didn't help.

On the other hand, when I try to pull for shorter periods and gradually increase it 30 days, able to see the logs. for example, I will start with 7 days, and then 12-14 days and then 20-25 days and in the end for 30 days. This way I could pull the logs, not sure if these are being cached in querier.

While our intention of testing loki is to replace ELK, but unless I would be able to atleast pull the data with timeouts, cannot completely switch from ELK.

Do you have any ideas/suggestions on fixing this?

Thanks Eswar

isshwar avatar Sep 07 '22 11:09 isshwar

:wave: it is not uncommon to hit a timeout. What does your query look like? Could you narrow the search with more labels?

jeschkies avatar Sep 07 '22 11:09 jeschkies

Hi team, we have the same issue with loki and gw timeout.

We have our own-hosted s3 (ceph) and rather huge number of logs ( for example 40-50Gi per day). We try to look 3h with easiest request (i.e {pod_labels_app=~"api-green|api-blue"} |= 'site.com' ) and get timeout after the 3-4 mins.

<html> <head><title>504 Gateway Time-out</title></head> <body> <center><h1>504 Gateway Time-out</h1></center> <hr><center>nginx</center> </body> </html> <!-- a padding to disable MSIE and Chrome friendly error page --> <!-- a padding to disable MSIE and Chrome friendly error page --> <!-- a padding to disable MSIE and Chrome friendly error page --> <!-- a padding to disable MSIE and Chrome friendly error page --> <!-- a padding to disable MSIE and Chrome friendly error page --> <!-- a padding to disable MSIE and Chrome friendly error page -->

Try to understand what we do wrong..

our loki-distributed values file loks like:

# Values: https://github.com/grafana/helm-charts/tree/main/charts/loki-distributed
---
loki:
  structuredConfig:
    auth_enabled: false
    analytics:
      reporting_enabled: false
    chunk_store_config:
      max_look_back_period: 0s
    compactor:
      shared_store: filesystem
      # shared_store: aws
      working_directory: /var/loki/boltdb-shipper-compactor
    distributor:
      ring:
        kvstore:
          store: memberlist
    frontend:
      compress_responses: true
      log_queries_longer_than: 5s
      tail_proxy_url: http://loki-loki-distributed-querier:3100
    frontend_worker:
      frontend_address: loki-loki-distributed-query-frontend:9095
    ingester:
      chunk_block_size: 262144
      chunk_encoding: snappy
      chunk_idle_period: 30m
      chunk_retain_period: 1m
      lifecycler:
        ring:
          kvstore:
            store: memberlist
          replication_factor: 1
      max_transfer_retries: 0
      wal:
        dir: /var/loki/wal
    limits_config:
      enforce_metric_name: false
      max_cache_freshness_per_query: 10m
      reject_old_samples: true
      reject_old_samples_max_age: 168h
      split_queries_by_interval: 15m
    memberlist:
      join_members:
      - loki-loki-distributed-memberlist
    query_range:
      align_queries_with_step: true
      cache_results: true
      max_retries: 5
      results_cache:
        cache:
          enable_fifocache: true
          fifocache:
            max_size_items: 1024
            validity: 24h
    ruler:
      alertmanager_url: https://alertmanager.xx
      external_url: https://alertmanager.xx
      ring:
        kvstore:
          store: memberlist
      rule_path: /tmp/loki/scratch
      storage:
        local:
          directory: /etc/loki/rules
        type: local
    schema_config:
      configs:
      - from: "2022-07-11"
        store: boltdb-shipper
        object_store: s3
        schema: v11
        index:
          prefix: loki_index_
          period: 24h
        # object_store: filesystem
    server:
      http_listen_port: 3100
      grpc_server_max_recv_msg_size: 24194304
      grpc_server_max_send_msg_size: 24194304 

    storage_config:
      boltdb_shipper:
        active_index_directory: /var/loki/boltdb-shipper-active
        cache_location: /var/loki/boltdb-shipper-cache
        cache_ttl: 24h
        shared_store: s3
      aws:
        insecure: true
        s3: https://[path]/storage-prod
        s3forcepathstyle: true
        http_config:
          idle_conn_timeout: 90s
          response_header_timeout: 0s
          insecure_skip_verify: true
    
      # boltdb_shipper:
      #   active_index_directory: /var/loki/index
      #   cache_location: /var/loki/cache
      #   cache_ttl: 168h
      #   shared_store: filesystem
      # filesystem:
      #   directory: /var/loki/chunks
    table_manager:
      retention_deletes_enabled: false
      retention_period: 0s


queryFrontend:
  replicas: 3
compactor:
  persistence:
    enabled: true
    size: 100Gi
indexGateway:
  enabled: true
  persistence:
    enabled: true
    size: 100Gi
ingester:
  persistence:
    enabled: true
    size: 100Gi
querier:
  persistence:
    enabled: true
    size: 100Gi
ruler:
  persistence:
    enabled: true
    size: 100Gi

yaroslavkasatikov avatar Sep 07 '22 13:09 yaroslavkasatikov

thanks for your reply.

I have only three labels in the my config

labels instance=$kubernetes['pod_name'],env=$env,app=$app

My query looks like:

{env="prod",app="reading-metadata"} | json |= "log=error"

what I want to achieve? while analysing issues, sometimes I might have to search with certain text(metadata related to the project) and may have to go back as long as 30 days.

Added to this , we ship ~20GB of data daily. to the s3 storage.

isshwar avatar Sep 07 '22 13:09 isshwar

Hi, I spent 1 year debugging the slow logql problem through tracing system. I have some experience you can refer to.

The performance of '| json' expr is too slow, you can pre-filter some logs by '|'

{env="prod",app="reading-metadata"} |= "error"| json |= "log=error" If the performance is still too slow, you can also speed up the query by replacing 'json' with 'regex'

{env="prod",app="reading-metadata"} |= "error"| | regexp "log":(?P.*?)(}|,)|= "log=error"

If it's still too slow, in addition to adding more machines, you can also try my branch, which is designed to solve the problem of grep being too slow https://github.com/grafana/loki/pull/5455

liguozhong avatar Sep 08 '22 08:09 liguozhong

split_queries_by_interval: 15m Another supplementary point, if you do not have enough querier instances, or do not configure a large enough 'max_querier_per_tenant' value, this configuration will cause 7~30 days logql to become very slow.

liguozhong avatar Sep 08 '22 08:09 liguozhong

Hi team, we have the same issue with loki and gw timeout.

We have our own-hosted s3 (ceph) and rather huge number of logs ( for example 40-50Gi per day). We try to look 3h with easiest request (i.e {pod_labels_app=~"api-green|api-blue"} |= 'site.com' ) and get timeout after the 3-4 mins.

<html> <head><title>504 Gateway Time-out</title></head> <body> <center><h1>504 Gateway Time-out</h1></center> <hr><center>nginx</center> </body> </html> <!-- a padding to disable MSIE and Chrome friendly error page --> <!-- a padding to disable MSIE and Chrome friendly error page --> <!-- a padding to disable MSIE and Chrome friendly error page --> <!-- a padding to disable MSIE and Chrome friendly error page --> <!-- a padding to disable MSIE and Chrome friendly error page --> <!-- a padding to disable MSIE and Chrome friendly error page -->

Try to understand what we do wrong..

our loki-distributed values file loks like:

# Values: https://github.com/grafana/helm-charts/tree/main/charts/loki-distributed
---
loki:
  structuredConfig:
    auth_enabled: false
    analytics:
      reporting_enabled: false
    chunk_store_config:
      max_look_back_period: 0s
    compactor:
      shared_store: filesystem
      # shared_store: aws
      working_directory: /var/loki/boltdb-shipper-compactor
    distributor:
      ring:
        kvstore:
          store: memberlist
    frontend:
      compress_responses: true
      log_queries_longer_than: 5s
      tail_proxy_url: http://loki-loki-distributed-querier:3100
    frontend_worker:
      frontend_address: loki-loki-distributed-query-frontend:9095
    ingester:
      chunk_block_size: 262144
      chunk_encoding: snappy
      chunk_idle_period: 30m
      chunk_retain_period: 1m
      lifecycler:
        ring:
          kvstore:
            store: memberlist
          replication_factor: 1
      max_transfer_retries: 0
      wal:
        dir: /var/loki/wal
    limits_config:
      enforce_metric_name: false
      max_cache_freshness_per_query: 10m
      reject_old_samples: true
      reject_old_samples_max_age: 168h
      split_queries_by_interval: 15m
    memberlist:
      join_members:
      - loki-loki-distributed-memberlist
    query_range:
      align_queries_with_step: true
      cache_results: true
      max_retries: 5
      results_cache:
        cache:
          enable_fifocache: true
          fifocache:
            max_size_items: 1024
            validity: 24h
    ruler:
      alertmanager_url: https://alertmanager.xx
      external_url: https://alertmanager.xx
      ring:
        kvstore:
          store: memberlist
      rule_path: /tmp/loki/scratch
      storage:
        local:
          directory: /etc/loki/rules
        type: local
    schema_config:
      configs:
      - from: "2022-07-11"
        store: boltdb-shipper
        object_store: s3
        schema: v11
        index:
          prefix: loki_index_
          period: 24h
        # object_store: filesystem
    server:
      http_listen_port: 3100
      grpc_server_max_recv_msg_size: 24194304
      grpc_server_max_send_msg_size: 24194304 

    storage_config:
      boltdb_shipper:
        active_index_directory: /var/loki/boltdb-shipper-active
        cache_location: /var/loki/boltdb-shipper-cache
        cache_ttl: 24h
        shared_store: s3
      aws:
        insecure: true
        s3: https://[path]/storage-prod
        s3forcepathstyle: true
        http_config:
          idle_conn_timeout: 90s
          response_header_timeout: 0s
          insecure_skip_verify: true
    
      # boltdb_shipper:
      #   active_index_directory: /var/loki/index
      #   cache_location: /var/loki/cache
      #   cache_ttl: 168h
      #   shared_store: filesystem
      # filesystem:
      #   directory: /var/loki/chunks
    table_manager:
      retention_deletes_enabled: false
      retention_period: 0s


queryFrontend:
  replicas: 3
compactor:
  persistence:
    enabled: true
    size: 100Gi
indexGateway:
  enabled: true
  persistence:
    enabled: true
    size: 100Gi
ingester:
  persistence:
    enabled: true
    size: 100Gi
querier:
  persistence:
    enabled: true
    size: 100Gi
ruler:
  persistence:
    enabled: true
    size: 100Gi

@liguozhong @jeschkies Team, could you please take a look into my config and give some advice? Don't want to create new issue, because the problem is the same -- too long query for s3 logs. Scaled queriers up to 12 pods. They eat ~2Gb ram while search. Split them from index gateway. Added 3 query frontends.. No lucky. The query for 3 hours got down with 504 Gateway timeout :( Yes, we have rather many logs (~30G per day). Logs aren't json or something like this.. plain lines. sometimes - long lines. I have no ideas how to improve it. S3 based on Ceph with SSD, so rather fast disks. Haven't seen any issues with network (there are balancers pool before S3). No errors in log, except logs on frontend:

level=info ts=2022-09-07T13:39:50.815911581Z caller=handler.go:174 org_id=fake msg="slow query detected" method=GET host=loki-loki-distributed-query-frontend.services.svc.cluster.local:3100 path=/loki/api/v1/query_range time_taken=37.116422835s param_end=1662425999000000000 param_limit=1000 param_query="{pod_labels_app=~\"api-green|api-blue\"} |~ `.*OK`" param_start=1662411600000000000 param_step=10000ms param_direction=backward

yaroslavkasatikov avatar Sep 09 '22 08:09 yaroslavkasatikov

Want to add that if you search logs without filter, I mean query like {pod_labels_app=~"api.*"} - it works for 24h examples, but returned 504 for volumes (check screen) photo_2022-09-09_12-43-09

But if I we add something like {pod_labels_app=~"api.*"}|= "text" - it leads to full timeout(

yaroslavkasatikov avatar Sep 09 '22 10:09 yaroslavkasatikov

@yaroslavkasatikov may i know if you running this kubernetes, could you let me know the replicas and resources you are provisioned for the stack.

@liguozhong still no luck. increased resources and still seeing the same error. while it is true that this could not be compared with ELK, but as we want to replace ELK with loki, atleast want to run the query to success. search criteria differs. unless, this doesn't throw 50* errors, cannot use this completely to replace ELK.

Also, for kubernetes distributed setup, do you have any suggestions on the replicas/resources?

below is the config loki config file. anything to improve.

`

auth_enabled: false compactor: shared_store: s3 working_directory: /data/compactor compaction_interval: 10m retention_enabled: true retention_delete_delay: 2h retention_delete_worker_count: 150 compactor_ring: kvstore: store: memberlist distributor: ring: kvstore: store: memberlist frontend: compress_responses: true log_queries_longer_than: 27s frontend_worker: frontend_address: loki-query-frontend:9095 grpc_client_config: max_send_msg_size: 17179869184 # 16GiB querier: engine: timeout: 3m max_look_back_period: 40s query_timeout: 2m query_ingesters_within: 1s ingester: flush_check_period: 30s flush_op_timeout: 3m concurrent_flushes: 64 chunk_block_size: 15728640 chunk_encoding: lz4 chunk_idle_period: 1m chunk_retain_period: 0m max_chunk_age: 5m lifecycler: ring: kvstore: store: memberlist replication_factor: 1 max_transfer_retries: 0 wal: dir: /data/loki/wal limits_config: max_entries_limit_per_query: 0 ingestion_rate_strategy: global reject_old_samples: true ingestion_rate_mb: 100 ingestion_burst_size_mb: 100 max_cache_freshness_per_query: 1m creation_grace_period: 1m enforce_metric_name: false split_queries_by_interval: 15m memberlist: join_members: - loki-memberlist query_range: align_queries_with_step: true parallelise_shardable_queries: true cache_results: true max_retries: 5 results_cache: cache: enable_fifocache: true fifocache: max_size_items: 1024 validity: 24h memcached_client: host: ${PROJECT_NAME}-memcached-frontend.${ENVIRONMENT}.svc.k3s-test.local service: http ruler: alertmanager_url: ${ALERTMANAGER_URL} ring: kvstore: store: memberlist rule_path: /var/loki/rules-tmp storage: local: directory: /loki/rules type: local enable_api: true enable_alertmanager_v2: true enable_sharding: true

  alertmanager_client:
    basic_auth_username: "${ALERTMANAGER_USERNAME}"
    basic_auth_password: "$${q}{ALERTMANAGER_PASSWORD}"
schema_config:
  configs:
    - index:
        period: 24h
        prefix: loki_index_
      object_store: s3
      schema: v11
      store: boltdb-shipper
server:
  http_listen_port: 3100
  log_level: info
  grpc_server_max_recv_msg_size: 10485760000
  grpc_server_max_send_msg_size: 10485760000
  http_server_write_timeout: 10m
  http_server_read_timeout: 10m
  http_server_idle_timeout: 10m
  graceful_shutdown_timeout: 2m
tracing:
  enabled: false
storage_config:
  boltdb_shipper:
    active_index_directory: /data/loki/index
    cache_location: /data/loki/cache
    index_gateway_client:
      server_address: dns:///loki-index-gateway:9095
    shared_store: s3
  aws:
    access_key_id: ${S3_ACCESS_KEY_ID}
    bucketnames: ${S3_BUCKET_NAME}
    endpoint: https://pure-storage.pageplace.de
    s3forcepathstyle: true
    secret_access_key: $${q}{S3_SECRET_ACCESS_KEY}
    http_config:
      response_header_timeout: 5s
      insecure_skip_verify: true
  index_queries_cache_config:
    memcached:
      batch_size: 1000
      parallelism: 1000
    memcached_client:
      host: ${PROJECT_NAME}-memcached-index-queries.${ENVIRONMENT}.svc.k3s-test.local
      service: http
chunk_store_config:
  chunk_cache_config:
      memcached:
        batch_size: 1000
        parallelism: 1000
      memcached_client:
        host: ${PROJECT_NAME}-memcached-chunks.${ENVIRONMENT}.svc.k3s-test.local
        service: http
  max_look_back_period: 0s
  write_dedupe_cache_config:
      memcached:
        batch_size: 1000
        parallelism: 1000
      memcached_client:
        host: ${PROJECT_NAME}-memcached-index-writes.${ENVIRONMENT}.svc.k3s-test.local
        service: http

`

isshwar avatar Sep 09 '22 10:09 isshwar

image image

increase ‘max_query_parallelism’ and ‘max_concurrent’ add ’split_queries_by_interval‘ split_queries_by_interval:15m ,Your current configuration needs to split the query into a huge query load and parallelism.

you need to show more error log. If error log is 504 timeout, it may also say the proxy timeout of grafana's loki datasource. By default, only 30s is a very low proxy time. This grafana datasource http proxy is used to forward logql queries.

liguozhong avatar Sep 09 '22 14:09 liguozhong

I have successfully migrated our ELK to loki. It can completely replace the original function of ELK. Loki needs to be carefully tuned and expanded.

1: grafana BI dashboard 2: alert rule 3: explore grep 4: api 5: explore OLAP

In the early stage, it took about 2 years to study the loki source code and deploy the loki cluster, and it took 9 months to migrate and offline ELK.

liguozhong avatar Sep 09 '22 15:09 liguozhong

For the OLAP-type BI report of 30-day large query, it is recommended that you store the data to prometheus through loki's recording rule feature. loki is not very good at OLAP

liguozhong avatar Sep 09 '22 15:09 liguozhong

If you need to pay attention to performance issues for a long time, I suggest you use the tracing system to trace the query of loki, and then cooperate with the source code of loki to see where it is slow.

liguozhong avatar Sep 10 '22 00:09 liguozhong

Thank you for sharing your work around BI like workload on loki. I've a somehow similar usecase. I would like to store an enriched network flow data model to provide a backend for network observability. In the netflow we found, among other things:

  • SrcAddr: 2^32 values (~4Billions)
  • DstAddr: 2^32 values (~4Billions)
  • SrcAS: 65K values
  • DstAS: 65K values
  • SrcPort: ~50K values (even if only few are wide used)
  • DstPort: ~50K values (even if only few are wide used)

So even with only those 6 over the 50+ fields below, we have roughly a potential cardinality of ~2e+38. We can assume that t he cardinality associated with netflow flow is infinite.

I’m wondering if loki is the best tool for that from the query perspective. The ingestion is doing really great. However the query side, from grafana. I’m unable to execute large queries such as: rate({app="goflow2", way="egress"} | json | unwrap bytes | __error__="" [1s]) I’ve plenty of resources dedicated to each target, but they seems not used:

  • 2 * query-scheduler 2 core/7GB
  • 2 * query-frontend 2 core/7GB
  • 20 * querier 16 core/60GB
  • 4 * distributor 4 core/15GB
  • 4 * ingester 16 core/60GB

I’ve been poking around the setting to get a working query, this is where I’m now. Basically I’ve increase each limit which prevent the execution of the query above. I reckon this is not the best way to use efficiently the compute resource at my disposal.

auth_enabled: False
server:
  grpc_listen_address: 10.0.3.41
  grpc_server_max_concurrent_streams: 1000
  grpc_server_max_recv_msg_size: 80000000
  grpc_server_max_send_msg_size: 80000000
  http_listen_address: 10.0.3.41
  http_listen_port: 3100
  http_server_idle_timeout: 40m
  http_server_read_timeout: 20m
  http_server_write_timeout: 20m
  log_format: json
  log_level: debug

distributor:
  ring:
    instance_interface_names:
    - ens4
    kvstore:
      prefix: loki/collectors/
      store: consul

querier:
  engine:
    timeout: 15m
  max_concurrent: 512
  query_timeout: 5m

ingester_client:
  grpc_client_config:
    grpc_compression: snappy
    max_recv_msg_size: 1048576000
    max_send_msg_size: 1048576000

ingester:
  lifecycler:
    interface_names:
    - ens4
    num_tokens: 128
    ring:
      kvstore:
        prefix: loki/collectors/
        store: consul
      replication_factor: 1
      zone_awareness_enabled: false

storage_config:
  boltdb:
    directory: /var/lib/loki/index
  filesystem:
    directory: /var/lib/loki/chunks

schema_config:
  configs:
  - from: 2020-05-15
    index:
      period: 168h
      prefix: index_
    object_store: filesystem
    schema: v11
    store: boltdb

limits_config:
  ingestion_burst_size_mb: 100
  ingestion_rate_mb: 50
  max_entries_limit_per_query: 20000
  max_queriers_per_tenant: 40
  max_query_parallelism: 5000
  max_query_series: 100000000
  per_stream_rate_limit: 30MB
  per_stream_rate_limit_burst: 60MB
  split_queries_by_interval: 5m

frontend_worker:
  grpc_client_config:
    grpc_compression: snappy
    max_recv_msg_size: 1048576000
    max_send_msg_size: 1048576000
  parallelism: 64
  scheduler_address: query-scheduler.loki-grpc.service.consul:9095

frontend:
  compress_responses: true
  grpc_client_config:
    grpc_compression: snappy
    max_recv_msg_size: 1048576000
    max_send_msg_size: 1048576000
  instance_interface_names:
  - ens4
  scheduler_address: query-scheduler.loki-grpc.service.consul:9095
  scheduler_worker_concurrency: 16

Neither my querier nor my query-frontend seems to be busy. Am I missing something obvious in my configuration? The queriers seems to have a hard time to reach out to the scheduler as I have the following error message: error notifying scheduler about finished query The queriers are sending huge reponse: response larger than max message size Then rpc error: code = ResourceExhausted desc = grpc: received message larger than max (100113955 vs. 80000000) despite the fact that the queriers managed to complete the query and I’ve this message error notifying frontend about finished query

For the context, I’m running loki 2.6.1 on baremetal (one target per host) with consul for the rings. The ingestion rate is stable at 220k/sec. Once I manage to solve the querying part I will back loki with an S3-like object storage.

wilfriedroset avatar Sep 10 '22 06:09 wilfriedroset

If you are using ebpf to monitor network requests, I recommend pre-aggregating on the agent side, and then pushing the pre-aggregated metrics to prometheus. loki is only responsible for the search of detailed logs for‘ {app="bar"}|="404"’ . metrics should be done by prometheus.

loki is not very very good at OLAP now

liguozhong avatar Sep 10 '22 08:09 liguozhong

If you have to use loki to store ebpf network monitoring data, then you should try to improve the performance of logql. E.g old logql rate({app="goflow2", way="egress"} | json | unwrap bytes | __error__="" [1s]) new logql

sum(
       rate(
            {app="goflow2", way="egress"} 
                  | regexp `"bytes":(?P<bytes>.*?)(,|\{|\[)`
                  | unwrap bytes  
            [1m]) 
) by(app,way)

liguozhong avatar Sep 10 '22 08:09 liguozhong

If you are using ebpf to monitor network requests, I recommend pre-aggregating on the agent side, and then pushing the pre-aggregated metrics to prometheus.

I cannot pre-aggregate the data as my netops team queries highly depends on the situation at hand. As such prometheus is not suited for this due to the cardinality being almost unlimited. We are investingating such tools:

  • https://github.com/netobserv/flowlogs-pipeline
  • https://github.com/netobserv/goflow2-loki-exporter
  • https://github.com/netsampler/goflow2

In fact, I'm trying to use loki to mimic kentik features.

In you rewritten query new logql, are you using regexp parser instead of the json due to the performance issue? If that is so, do you have experience with the logfmt parser from a performance perspective?

wilfriedroset avatar Sep 10 '22 09:09 wilfriedroset

If you are using ebpf to monitor network requests, I recommend pre-aggregating on the agent side, and then pushing the pre-aggregated metrics to prometheus.

I cannot pre-aggregate the data as my netops team queries highly depends on the situation at hand. As such prometheus is not suited for this due to the cardinality being almost unlimited. We are investingating such tools:

  • https://github.com/netobserv/flowlogs-pipeline
  • https://github.com/netobserv/goflow2-loki-exporter
  • https://github.com/netsampler/goflow2

In fact, I'm trying to use loki to mimic kentik features.

In you rewritten query new logql, are you using regexp parser instead of the json due to the performance issue? If that is so, do you have experience with the logfmt parser from a performance perspective?

hi, You can pre-aggregate net data through promtail's 'metrics' pipeline, and only logql's 'metrics query' will not achieve the performance you expect. https://grafana.com/docs/loki/latest/clients/promtail/stages/metrics/

liguozhong avatar Sep 13 '22 07:09 liguozhong

logfmt parser

'logfmt parser' has no performance issues as far as I know, but 'json parser' does have lower performance.

liguozhong avatar Sep 13 '22 08:09 liguozhong

@liguozhong Along with the applying suggestions that you have pointed out and increasing the system resoures, i am able to see gateway errors less times that earlier. for the time being, stopping it there as I have been running in to another issue.

We have elk running parallel to loki. When I compare logs lines for the same query between loki and elk , I see loki is missing/not-reporting some log lines that I can on kibana. Wanted to know, if there is any issue with loki missing some log line to ingest to ingester or querier not returning all the loglines and some other config issue?

Thanks Eswar

isshwar avatar Sep 19 '22 13:09 isshwar

sum by (tenant,reason) (rate(loki_discarded_bytes_total{}[1m]))>0

It should trigger the ratelimit behavior of loki. You can check which reason through this promql, and then adjust the corresponding configuration threshold according to the reason.

more info: https://github.com/grafana/loki/pull/7145/files

image

liguozhong avatar Sep 21 '22 05:09 liguozhong

@liguozhong

unfortunately, atleast in the last 24 hours, i don't see any discarded bytes after making the some changes that includes reject_old_samples: false and max_chunk_age: 120m. I have reduced the number of streams pausing the logs from prod cluster to rule out the fact that this is not due to clogging on the network. do you have any other suggestion?

isshwar avatar Sep 21 '22 11:09 isshwar

The early loki version may have some inaccurate data bugs. What is your current loki version? In addition, I need to look at more detailed data (such as what logql is, and how long is the query time). I think it's hard to share such comprehensive data in this issue. This problem is difficult to solve in this issue.

liguozhong avatar Sep 28 '22 05:09 liguozhong

Hi, same issue here. Loki version 2.6.1 with default configurations using distributed helm chart I increased almost every time out setting in Nginx gateway and Loki itself. My test Query: {cluster: "test"} |= "test" My traffic is 6k log per second and 50Gib log per day using Minio s3 backend Error: 504 Nginx gateway / sometimes 502 [ in 3-6 hour query ]

LinTechSo avatar Sep 28 '22 08:09 LinTechSo

@liguozhong

I am running 2.61. I didn't get you completely. Are you saying that the problem is difficult to solve? While looking to find more details on the issue, I found that there are other users who are struggling with the same issue. so, this is not a issue only with my config. Only, when I started to compare the log data between Elk and Loki, I got to know of this issue.

Without fixing this, the migration would stall in my case. Do you think that this can be fixed?

isshwar avatar Sep 28 '22 08:09 isshwar

this issue I think is quite similar to the https://github.com/grafana/loki/issues/4015 I also used suggestions from that issue but nothing helped and the problem still persists. and also, in my opinion, increasing the timeout configuration is not a solution. Don't get to the core of the problem which is slow performance.

LinTechSo avatar Sep 28 '22 09:09 LinTechSo

Hi. I have an update about my test with the above traffic rate and below configuration. Any suggestion/advice would help. thanks.
@slim-bean my resources:

Querier:
    Memory:  6Gi
    CPU:      
          Request: 3
          Limit:    6
    Min Replica:    3
Ingester:
      Memory: 7Gi
      CPU:
            Request: 500m
            Limit:     1
      Min Replica:    2
Query FrontEnd:
        Memory: 1Gi
        CPU:
              Request: 1
              Limit: 2
        Min Replica: 2
Query Scheduler:
        Memory: 1Gi
         CPU:
                Request: 1
                Limit: 2
        Min Replica: 2

and my configuration

      max_recv/send_msg_size:     536870912000
      http_server_read/write/idle_timeout:     600
      proxy_read/send_timeout     300;
      proxy_connect_timeout     300;
      client_max_body_size     100M;
      Querier timeout:  10m
      max_query_parallelism:     32
      split_queries_by_interval:     15m
      Max_concurrent:     512
      Cache enabled
      Querier replica set to 3 with hpa

and result [3-6 hour]:

Screenshot_2022-09-30_09-58-10

and also I increase the replica of the querier pod to 32 but nothing changed appreciate it if you could help me figure this out. how can I increase performance or at least get rid of 504 errors

LinTechSo avatar Oct 02 '22 12:10 LinTechSo

The 504 error is coming from nginx, increasing timeouts in nginx should remove the 504 proxy_read_timeout iirc is the timeout to change

slim-bean avatar Oct 02 '22 12:10 slim-bean

I just typed out a longer response but I'm on my phone and bumped a link and lost it 😬

The short version is that we are able to get query throughput out of Loki at rates of 50-100GB/s, we've had good success making Loki queries fast. Although there are many reasons from config to resources to storage that affect performance.

But not all storage is created equal... We use GCS on Google and S3 on Amazon which both work very well for highly parallelized workloads.

One of the examples in this thread is using a filesystem store which won't parallize at all, and I saw a mention of ceph which can be fast but we've also seen it slow, depends on how well it's built.

I'll try to follow up next week with some benchmark examples using the minio warp tool

slim-bean avatar Oct 02 '22 13:10 slim-bean

The short version is that we are able to get query throughput out of Loki at rates of 50-100GB/s, we've had good success making Loki queries fast. Although there are many reasons from config to resources to storage that affect performance.

I've been testing loki for a fair amount of time without great success. I cannot go above 1GB/s despite scaling the cluster vertically and horizontally. I've also poked around in the configuration to see what are the impact on the performance. I've also tried with and without memcached or redis which end in all ingesters going OOM. My current conclusion is that tuning loki is not straighforward and I'm a bit lost. I understand it can be hard as you pointed not all storage are equals. On my end i'm using a S3 compatible, the same one I'm using with mimir. My mimir clusters are working like a charm.

@slim-bean would you be able to recommend a production ready configuration with sensible setting to get performance on read path and capacity planning, something similar to mimir?

wilfriedroset avatar Oct 02 '22 13:10 wilfriedroset

I meet the same issue.

I have 1Tb log per day, about 800 0000 count/ per min.

My cluster info: ec2 : 16c + 64Gb * 3 storage: aws s3

ingest: 3 queirer: 3 query-front: 3 query schedule: 3 gateway: 3

when I query about 3 hour log , my ingest get oom . I don't know how hug mem should I need . May someone share your cluster info and configuration. Thanks.

markx1916 avatar Oct 29 '22 06:10 markx1916