loki icon indicating copy to clipboard operation
loki copied to clipboard

Missing logs in Loki

Open pvlltvk opened this issue 3 years ago • 10 comments

Describe the bug I use Fluent Bit which sends logs from my Kubernetes nodes to Loki and Newrelic simultaneously. I found out that some logs that I could find in Newrelic were missing in Loki. I also can find those missing logs on my Kubernetes nodes or using kubectl logs.

To Reproduce Steps to reproduce the behavior:

  1. Started Loki Distributed v2.4.1 with S3 and BoltDB Shipper backend
  2. Started Fluent Bit v1.8.6

Expected behavior I expect all logs to be available in Loki.

Environment:

  • Infrastructure: Kubernetes
  • Deployment tool: official helm-charts for Fluent Bit (helm-chart version - 0.17.0) and Loki Distributed (helm-chart version - 0.39.2)

Screenshots, Promtail config, or terminal output My Loki config:

Click to expand
  auth_enabled: false
  server:
    http_listen_port: 3100
    grpc_server_max_recv_msg_size: 104857600
    grpc_server_max_send_msg_size: 104857600
  distributor:
    ring:
      kvstore:
        store: memberlist
  memberlist:
    join_members:
      - loki-loki-distributed-memberlist
  ingester:
    lifecycler:
      ring:
        kvstore:
          store: memberlist
        replication_factor: 3
    chunk_idle_period: 1h
    chunk_target_size: 1536000
    max_chunk_age: 1h
    max_transfer_retries: 0
    wal:
      enabled: true
      dir: /var/loki/wal
      replay_memory_ceiling: 2GB
  limits_config:
    enforce_metric_name: false
    reject_old_samples: true
    reject_old_samples_max_age: 168h
    max_cache_freshness_per_query: 10m
    retention_period: 2232h
    retention_stream:
      - selector: '{cluster_name="staging"}'
        priority: 1
        period: 168h
  schema_config:
    configs:
      - from: 2021-11-18
        store: boltdb-shipper
        object_store: aws
        schema: v11
        index:
          prefix: loki_index_
          period: 24h
  storage_config:
    aws:
      bucketnames: loki-logging-data
      endpoint: https://storage.endpoint.net
      region: eu-central-1
      access_key_id: access_key_id
      secret_access_key: secret_access_key
      insecure: false
    boltdb_shipper:
      active_index_directory: /var/loki/index
      shared_store: s3
      cache_location: /var/loki/cache
      index_gateway_client:
        server_address: dns:///loki-loki-distributed-index-gateway:9095
    index_queries_cache_config:
      redis:
        endpoint: loki-redis-node-0.loki-redis-headless:26379,loki-redis-node-1.loki-redis-headless:26379,loki-redis-node-2.loki-redis-headless:26379
        master_name: loki
        expiration: 6h
        db: 0
        password: password
        timeout: 1000ms
  chunk_store_config:
    chunk_cache_config:
      redis:
        endpoint: loki-redis-node-0.loki-redis-headless:26379,loki-redis-node-1.loki-redis-headless:26379,loki-redis-node-2.loki-redis-headless:26379
        master_name: loki
        expiration: 6h
        db: 1
        password: password
        timeout: 1000ms
  querier:
    query_timeout: 5m
    max_concurrent: 48
  query_range:
    # make queries more cache-able by aligning them with their step intervals
    align_queries_with_step: true
    max_retries: 5
    # parallelize queries in 15min intervals
    split_queries_by_interval: 15m
    cache_results: true
    results_cache:
      cache:
        redis:
          endpoint: loki-redis-node-0.loki-redis-headless:26379,loki-redis-node-1.loki-redis-headless:26379,loki-redis-node-2.loki-redis-headless:26379
          master_name: loki
          expiration: 6h
          db: 2
          password: password
          timeout: 1000ms
  frontend_worker:
    frontend_address: loki-loki-distributed-query-frontend-grpclb:9095
    parallelism: 12
  frontend:
    log_queries_longer_than: 5s
    compress_responses: true
    tail_proxy_url: loki-loki-distributed-querier:3100
  compactor:
    retention_enabled: true

My fluent-bit config:

Click to expand
  [SERVICE]
      Flush 1
      Daemon Off
      Log_Level warn
      Parsers_File parsers.conf
      Parsers_File custom_parsers.conf
      HTTP_Server On
      HTTP_Listen 0.0.0.0
      HTTP_Port 2020
      storage.path /var/log/fluent-storage/
      storage.sync normal
      storage.checksum off
      storage.backlog.mem_limit 16M
      storage.metrics on
  [INPUT]
      Name tail
      Path /var/log/containers/*.log
      Parser docker
      Tag kube.*
      Skip_Long_Lines On
      Mem_Buf_Limit 64M
      storage.type  filesystem
  [INPUT]
      Name systemd
      Tag host.*
      Read_From_Tail On
      Mem_Buf_Limit 16M
      storage.type  filesystem
  [FILTER]
      Name record_modifier
      Match *
      Record cluster_name infra
      Record environment infra
  [FILTER]
      Name record_modifier
      Match host.*
      Record log_type system
  [FILTER]
      Name record_modifier
      Match kube.*
      Record log_type kubernetes
  [FILTER]
      Name kubernetes
      Match kube.*
      Merge_Log On
      Keep_Log Off
      K8S-Logging.Parser On
      K8S-Logging.Exclude On
  [OUTPUT]
      Name loki
      Match kube.*
      host gateway-loki.foo.com
      port 443
      tls  on
      tls.verify on
      labels $cluster_name, $environment, $log_type, $kubernetes['namespace_name'], $kubernetes['container_name']
      storage.total_limit_size 512M
      Retry_Limit False
      workers 1
  [OUTPUT]
      Name loki
      Match host.*
      host gateway-loki.foo.com
      port 443
      tls  on
      tls.verify on
      labels $cluster_name, $environment, $log_type
      storage.total_limit_size 512M
      Retry_Limit False
      workers 1
  [OUTPUT]
      Name nrlogs
      Match *
      license_key ${API_KEY}
      storage.total_limit_size 512M
      Retry_Limit False
      workers 1

There are also some errors from the Loki ingester service:

msg="failed to flush user" err="RequestError: send request failed\ncaused by: Put \"https://loki-logging-data.storage.yandexcloud.net/fake/8c86bbf8e29cca2b%3A17dbfea5a9d%3A17dbfed6b14%3Ae86971ff\": http: server closed idle connection"
msg="failed to flush user" err="RequestCanceled: request context canceled\ncaused by: context deadline exceeded"
msg="failed to flush user" err="RequestError: send request failed\ncaused by: Put \"https://loki-logging-data.storage.yandexcloud.net/fake/81718d22e0c8b67d%3A17dc0c3a00c%3A17dc0fadcbb%3Aaa9ab989\": http: server closed idle connection"

pvlltvk avatar Dec 16 '21 09:12 pvlltvk

using minio by any chance?

rlex avatar Feb 04 '22 01:02 rlex

Same problem, using minio.

DrissiReda avatar Feb 11 '22 15:02 DrissiReda

@rlex @DrissiReda Sorry, I forgot to give an update. I don't use Minio. In my case the problem was solved by replacing Fluent Bit with Promtail.

pvlltvk avatar Feb 11 '22 15:02 pvlltvk

Didn't you figure out the problem with fluentbit? because that's what I'm using but I can't change it since I use it for other stuff

DrissiReda avatar Feb 11 '22 15:02 DrissiReda

@DrissiReda No, I didn't. Maybe I'll test it again when i have more time.

pvlltvk avatar Feb 20 '22 15:02 pvlltvk

Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

  • Mark issues as revivable if we think it's a valid issue but isn't something we are likely to prioritize in the future (the issue will still remain closed).
  • Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.

stale[bot] avatar Apr 17 '22 06:04 stale[bot]

seeing the same with latest loki and latest fluentbit , per doc 'loki output' config, and lots of lots of missing logs

korenlev avatar Apr 19 '22 07:04 korenlev

We have the same situation. With promtail we see all logs, with fluentbit-loki we are missing some. It seams to me that some streams are broken after some time

Ruppsn avatar Apr 25 '22 12:04 Ruppsn

see https://github.com/grafana/loki/issues/4221

irizzant avatar May 11 '22 10:05 irizzant

I'm running Loki 2.6.1 and Fluentbit 1.9.5 and I'm missing logs. There are no error messages. Sometimes they are there and sometimes they aren't. I guess the workaround is to use promtail. Unfortunately promtail uses a lot of CPU at times. oh well

My workaround is to use the fluent-bit-loki-plugin

data-dude avatar Jul 26 '22 18:07 data-dude

Have the same situation.

nlnjnj avatar Aug 18 '22 01:08 nlnjnj

seeing the same. fluentbit->cloudwatch/loki; cloudwatch has everything, loki does not.

chadgeary avatar Oct 03 '22 16:10 chadgeary

Same or similar, we run latest Loki and a fluent-bit (also promtail) with Minio as storage... But in our case different Loki instances return different data. Consistently the same data from the same Loki-instance, though.

maxramqvist avatar Mar 09 '23 10:03 maxramqvist

Seeing a similar issue. Vector --> ElasticSearch/Loki. ElasticSearch has full logs, Loki does not

khanh96le avatar May 08 '23 06:05 khanh96le

Similar problem. I use efk and grafana loki with filesystem as storage. ElasticSearch has full logs, but grafana loki display 2h old logs only. Loki version: 2.6.1

tkblack avatar May 25 '23 03:05 tkblack

I came across this issue having a similar problem with Vector.

We have a set of hosts sending logs from just a few services, and two of those hosts are test hosts where not much is logged. It seems like the two quiet hosts disappear entirely after a while (a query with {host="foo"} has no recent hits), while the ones that generate more frequent logs remain query-able.

Looking at the errors in the original report, it seems like Loki closes idle connections after a while so that clients need to reestablish the connection after a certain period of idleness? Maybe there's a client-side setting that will periodically wake up the connection.

suckatrash avatar Jan 04 '24 15:01 suckatrash

Similar problem. We use fluent-bit to send logs to both fluentd and loki. Fluentd has all the logs but loki is missing some. There are no error logs in either Loki or Fluent-bit related to this. Has anybody found the issue or fixed this without switching to promtail for log collection?

satyamsundaram avatar Mar 04 '24 08:03 satyamsundaram