loki
loki copied to clipboard
Sending logs via systemd-run/logger/systemd-cat drops a small percentage of the messages
To thoroughly test our Loki deployment we wrote a script that does an end-to-end test through this path:
- Write to Journal.
- Promtail scrapes Journal and ships to Loki.
- Loki receives the log and stores it.
- A query to loki reveals the unique message written to Journal.
We initially wrote it with the following approaches (trying them out one by one):
-
logger --journal <message-file>
-
sytemd-run echo <message>
-
echo <message> | systemd-cat
All of them turn out to be unreliable. They work most of the time, but maybe one in 10 messages is dropped. When using a long-lived systemd service, the messages work 100% of the time.
What is interesting, is that through logger
and systemd-cat
, the messages do not show up in promtail -log.level debug
, while the systemd-run
one does. All of them however, show up in promtail -dry-run
. And of course, most of them do make it to Loki.
To Reproduce Steps to reproduce the behavior (see configs further down):
- Run Loki.
- Run Promtail on another host.
- Run the following query on Loki:
logcli query -o raw '{host="hostname"} |= "foo"' --tail
- Run any of the following commands on the host:
echo "MESSAGE=foo" > msg; logger --journal msg
echo foo | systemd-cat
systemd-run echo foo
- Notice that maybe 1 in 10 messages never reaches the query output.
Expected behavior No messages are dropped.
Environment:
- Infrastructure: VMs (Ubuntu), Promtail 2.3, Loki 2.4 (from
main-04ab08d
). - Deployment tool: Ansible
Promtail Config
server:
http_listen_port: 9080
http_listen_address: 127.0.0.1
grpc_listen_port: 9085
grpc_listen_address: 127.0.0.1
log_level: info
positions:
filename: /var/lib/promtail/positions.yaml
clients:
- url: https://1.2.3.4/distributor/loki/api/v1/push
# 5 10 20 40 80 160 320 320 ... 320
# For a total time of 11195s ~185m before logs are lost
# 5 + 10 + 20 + 40 + 80 + 160 + 320 + 33 * 320
backoff_config:
min_period: 5s
max_period: 320s
max_retries: 40
tls_config:
ca_file: /etc/pki/realms/host/CA.crt
cert_file: /etc/pki/realms/host/default.crt
key_file: /etc/pki/realms/host/default.key
server_name: 1.2.3.4
scrape_configs:
- job_name: journal
journal:
json: true
max_age: 12h
labels:
env: test
component: internal
path: /var/log/journal
relabel_configs:
- source_labels: ['__journal__hostname']
target_label: 'host'
- source_labels: ['__journal__systemd_unit']
target_label: 'unit'
pipeline_stages:
- match:
selector: '{unit!~""}'
stages:
- json:
expressions:
{'cmdline': '_CMDLINE', 'facility': 'SYSLOG_FACILITY', 'message': 'MESSAGE', 'priority': 'PRIORITY', 'unit': '_SYSTEMD_UNIT'}
- pack:
labels:
['cmdline', 'facility', 'message', 'priority', 'unit']
- labeldrop:
- unit
Loki Config
auth_enabled: false
server:
log_level: info
distributor:
ring:
kvstore:
store: consul
consul:
host: localhost:8500
ingester:
lifecycler:
address: 127.0.0.1
ring:
kvstore:
store: consul
consul:
host: localhost:8500
replication_factor: 1
final_sleep: 0s
max_transfer_retries: 0
flush_op_timeout: 60s
flush_check_period: 30s
concurrent_flushes: 8
chunk_target_size: 4194304
max_chunk_age: 2h
chunk_idle_period: 2h
wal:
enabled: true
# dir: configured in the service file
schema_config:
configs:
- from: 2021-09-03
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
storage_config:
boltdb_shipper:
active_index_directory: /srv/logs/loki/index
cache_location: /srv/logs/loki/index_cache
shared_store: filesystem
filesystem:
directory: /srv/loki/chunks
compactor:
working_directory: /srv/loki/compactor
shared_store: filesystem
compaction_interval: 5m
retention_enabled: true
retention_delete_delay: 2h
retention_delete_worker_count: 150
limits_config:
reject_old_samples: true
reject_old_samples_max_age: 168h
unordered_writes: true
retention_period: "1d"
retention_stream:
- selector: '{host="test"}'
priority: 1
period: 12h
chunk_store_config:
max_look_back_period: 0s
table_manager:
retention_deletes_enabled: false
retention_period: 0s
frontend_worker:
frontend_address: '127.0.0.1:9091'
frontend:
tail_proxy_url: http://127.0.0.1:3112
querier:
query_timeout: 2m
query_ingesters_within: 168h
query_range:
# make queries more cache-able by aligning them with their step intervals
align_queries_with_step: true
max_retries: 5
# parallelize queries in 15min intervals
split_queries_by_interval: 15m
cache_results: true
parallelise_shardable_queries: true
results_cache:
cache:
# We're going to use the in-process "FIFO" cache
enable_fifocache: true
fifocache:
size: 1024
validity: 24h
Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.
We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.
Stalebots are also emotionless and cruel and can close issues which are still very relevant.
If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.
We regularly sort for closed issues which have a stale
label sorted by thumbs up.
We may also:
- Mark issues as
revivable
if we think it's a valid issue but isn't something we are likely to prioritize in the future (the issue will still remain closed). - Add a
keepalive
label to silence the stalebot if the issue is very common/popular/important.
We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.
This bug still exist, though we work around it by not using systemd-run/logger/systemd-cat.
Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.
We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.
Stalebots are also emotionless and cruel and can close issues which are still very relevant.
If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.
We regularly sort for closed issues which have a stale
label sorted by thumbs up.
We may also:
- Mark issues as
revivable
if we think it's a valid issue but isn't something we are likely to prioritize in the future (the issue will still remain closed). - Add a
keepalive
label to silence the stalebot if the issue is very common/popular/important.
We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.
We can still reproduce this bug.
@href is this still an issue in current release?
Kind of, but now I'm not sure if it's the same issue we've seen before, or if I just discovered a lead:
I generated the messages as described above, and they all show up in Grafana. But not together. Most of the time when I run the query, I get 13 out of 15 messages, but sometimes I get 2 out of 15 messages. From the timestamps I can tell that as a whole all messages are in Loki.
However, they do not show together. In Grafana I see either 13 or 2 messages, with multiple refreshes. In the Loki CLI I only ever see the 13 messages.
We run two querier processes, so maybe there's some kind of disagreement on what they are seeing? Could this be a configuration issue? We didn't have Grafana at the time, so it could still be the same issue, as Loki always seems to show the same set of messages (so my original issue holds).
But in Grafana I can now tell that the messages do make it to Loki. It does not seem to be a Promtail issue.
so via curl or logcli you always see all the expected log results? could you share more the exact query you're running?
we may need to file a grafana issue
I saw the issue in logcli
as well:
- In Grafana I would usually see 13/15 messages, then the other 2/15 messages (mostly 13/15).
- In Loki I would only see 13/15 messages.
Today, when I look for the messages from yesterday, only 13/15 show up, even in Grafana. The 2/15 messages that complement the set seem to be gone.
I ran another test just now, and I saw the following behavior in both logcli
and Grafana. This time it's either 25/25, or 0/25:
# After having sent `echo foo | systemd-cat` on `lab-test1-lpg1` 25 times (spaced out).
root@lab-log1-lpg1:~# for i in {1..20}; do logcli query -qo raw '{host="lab-test1-lpg1"} |= "foo" | json | line_format `{{ .message }}`' | wc -l; done
25
25
25
0
0
0
0
0
0
0
0
25
25
25
25
25
25
25
25
25
In Grafana it's the same.
May or may not be relevant but I see similar message dropping behavior in a NixOS systemd script scheduled by a timer using systemd-cat
or logger
. I use them to set a priority to messages.
Plain echo ...
Never fails.