loki
loki copied to clipboard
Missing logs in Loki
Describe the bug I use Fluent Bit which sends logs from my Kubernetes nodes to Loki and Newrelic simultaneously. I found out that some logs that I could find in Newrelic were missing in Loki. I also can find those missing logs on my Kubernetes nodes or using kubectl logs.
To Reproduce Steps to reproduce the behavior:
- Started Loki Distributed v2.4.1 with S3 and BoltDB Shipper backend
- Started Fluent Bit v1.8.6
Expected behavior I expect all logs to be available in Loki.
Environment:
- Infrastructure: Kubernetes
- Deployment tool: official helm-charts for Fluent Bit (helm-chart version - 0.17.0) and Loki Distributed (helm-chart version - 0.39.2)
Screenshots, Promtail config, or terminal output My Loki config:
Click to expand
auth_enabled: false
server:
http_listen_port: 3100
grpc_server_max_recv_msg_size: 104857600
grpc_server_max_send_msg_size: 104857600
distributor:
ring:
kvstore:
store: memberlist
memberlist:
join_members:
- loki-loki-distributed-memberlist
ingester:
lifecycler:
ring:
kvstore:
store: memberlist
replication_factor: 3
chunk_idle_period: 1h
chunk_target_size: 1536000
max_chunk_age: 1h
max_transfer_retries: 0
wal:
enabled: true
dir: /var/loki/wal
replay_memory_ceiling: 2GB
limits_config:
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 168h
max_cache_freshness_per_query: 10m
retention_period: 2232h
retention_stream:
- selector: '{cluster_name="staging"}'
priority: 1
period: 168h
schema_config:
configs:
- from: 2021-11-18
store: boltdb-shipper
object_store: aws
schema: v11
index:
prefix: loki_index_
period: 24h
storage_config:
aws:
bucketnames: loki-logging-data
endpoint: https://storage.endpoint.net
region: eu-central-1
access_key_id: access_key_id
secret_access_key: secret_access_key
insecure: false
boltdb_shipper:
active_index_directory: /var/loki/index
shared_store: s3
cache_location: /var/loki/cache
index_gateway_client:
server_address: dns:///loki-loki-distributed-index-gateway:9095
index_queries_cache_config:
redis:
endpoint: loki-redis-node-0.loki-redis-headless:26379,loki-redis-node-1.loki-redis-headless:26379,loki-redis-node-2.loki-redis-headless:26379
master_name: loki
expiration: 6h
db: 0
password: password
timeout: 1000ms
chunk_store_config:
chunk_cache_config:
redis:
endpoint: loki-redis-node-0.loki-redis-headless:26379,loki-redis-node-1.loki-redis-headless:26379,loki-redis-node-2.loki-redis-headless:26379
master_name: loki
expiration: 6h
db: 1
password: password
timeout: 1000ms
querier:
query_timeout: 5m
max_concurrent: 48
query_range:
# make queries more cache-able by aligning them with their step intervals
align_queries_with_step: true
max_retries: 5
# parallelize queries in 15min intervals
split_queries_by_interval: 15m
cache_results: true
results_cache:
cache:
redis:
endpoint: loki-redis-node-0.loki-redis-headless:26379,loki-redis-node-1.loki-redis-headless:26379,loki-redis-node-2.loki-redis-headless:26379
master_name: loki
expiration: 6h
db: 2
password: password
timeout: 1000ms
frontend_worker:
frontend_address: loki-loki-distributed-query-frontend-grpclb:9095
parallelism: 12
frontend:
log_queries_longer_than: 5s
compress_responses: true
tail_proxy_url: loki-loki-distributed-querier:3100
compactor:
retention_enabled: true
My fluent-bit config:
Click to expand
[SERVICE]
Flush 1
Daemon Off
Log_Level warn
Parsers_File parsers.conf
Parsers_File custom_parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
storage.path /var/log/fluent-storage/
storage.sync normal
storage.checksum off
storage.backlog.mem_limit 16M
storage.metrics on
[INPUT]
Name tail
Path /var/log/containers/*.log
Parser docker
Tag kube.*
Skip_Long_Lines On
Mem_Buf_Limit 64M
storage.type filesystem
[INPUT]
Name systemd
Tag host.*
Read_From_Tail On
Mem_Buf_Limit 16M
storage.type filesystem
[FILTER]
Name record_modifier
Match *
Record cluster_name infra
Record environment infra
[FILTER]
Name record_modifier
Match host.*
Record log_type system
[FILTER]
Name record_modifier
Match kube.*
Record log_type kubernetes
[FILTER]
Name kubernetes
Match kube.*
Merge_Log On
Keep_Log Off
K8S-Logging.Parser On
K8S-Logging.Exclude On
[OUTPUT]
Name loki
Match kube.*
host gateway-loki.foo.com
port 443
tls on
tls.verify on
labels $cluster_name, $environment, $log_type, $kubernetes['namespace_name'], $kubernetes['container_name']
storage.total_limit_size 512M
Retry_Limit False
workers 1
[OUTPUT]
Name loki
Match host.*
host gateway-loki.foo.com
port 443
tls on
tls.verify on
labels $cluster_name, $environment, $log_type
storage.total_limit_size 512M
Retry_Limit False
workers 1
[OUTPUT]
Name nrlogs
Match *
license_key ${API_KEY}
storage.total_limit_size 512M
Retry_Limit False
workers 1
There are also some errors from the Loki ingester service:
msg="failed to flush user" err="RequestError: send request failed\ncaused by: Put \"https://loki-logging-data.storage.yandexcloud.net/fake/8c86bbf8e29cca2b%3A17dbfea5a9d%3A17dbfed6b14%3Ae86971ff\": http: server closed idle connection"
msg="failed to flush user" err="RequestCanceled: request context canceled\ncaused by: context deadline exceeded"
msg="failed to flush user" err="RequestError: send request failed\ncaused by: Put \"https://loki-logging-data.storage.yandexcloud.net/fake/81718d22e0c8b67d%3A17dc0c3a00c%3A17dc0fadcbb%3Aaa9ab989\": http: server closed idle connection"
using minio by any chance?
Same problem, using minio.
@rlex @DrissiReda Sorry, I forgot to give an update. I don't use Minio. In my case the problem was solved by replacing Fluent Bit with Promtail.
Didn't you figure out the problem with fluentbit? because that's what I'm using but I can't change it since I use it for other stuff
@DrissiReda No, I didn't. Maybe I'll test it again when i have more time.
Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.
We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.
Stalebots are also emotionless and cruel and can close issues which are still very relevant.
If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.
We regularly sort for closed issues which have a stale
label sorted by thumbs up.
We may also:
- Mark issues as
revivable
if we think it's a valid issue but isn't something we are likely to prioritize in the future (the issue will still remain closed). - Add a
keepalive
label to silence the stalebot if the issue is very common/popular/important.
We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.
seeing the same with latest loki and latest fluentbit , per doc 'loki output' config, and lots of lots of missing logs
We have the same situation. With promtail we see all logs, with fluentbit-loki we are missing some. It seams to me that some streams are broken after some time
see https://github.com/grafana/loki/issues/4221
I'm running Loki 2.6.1 and Fluentbit 1.9.5 and I'm missing logs. There are no error messages. Sometimes they are there and sometimes they aren't. I guess the workaround is to use promtail. Unfortunately promtail uses a lot of CPU at times. oh well
My workaround is to use the fluent-bit-loki-plugin
Have the same situation.
seeing the same. fluentbit->cloudwatch/loki; cloudwatch has everything, loki does not.
Same or similar, we run latest Loki and a fluent-bit (also promtail) with Minio as storage... But in our case different Loki instances return different data. Consistently the same data from the same Loki-instance, though.
Seeing a similar issue. Vector --> ElasticSearch/Loki. ElasticSearch has full logs, Loki does not
Similar problem. I use efk and grafana loki with filesystem as storage. ElasticSearch has full logs, but grafana loki display 2h old logs only. Loki version: 2.6.1
I came across this issue having a similar problem with Vector.
We have a set of hosts sending logs from just a few services, and two of those hosts are test hosts where not much is logged. It seems like the two quiet hosts disappear entirely after a while (a query with {host="foo"} has no recent hits), while the ones that generate more frequent logs remain query-able.
Looking at the errors in the original report, it seems like Loki closes idle connections after a while so that clients need to reestablish the connection after a certain period of idleness? Maybe there's a client-side setting that will periodically wake up the connection.
Similar problem. We use fluent-bit to send logs to both fluentd and loki. Fluentd has all the logs but loki is missing some. There are no error logs in either Loki or Fluent-bit related to this. Has anybody found the issue or fixed this without switching to promtail for log collection?