python-systemd
python-systemd copied to clipboard
When monitoring /var/log/journal/remote events are missed when two remote hosts send messages at the same time (updates)
Context
- 1 x log server (VM on Apple M2)
- 2 x log client (VM on Apple M2)
- Using systemd-journal-remote on server (http)
- Using systemd-journal-upload on clients (http)
Testing Method We first cleanup the server with:
rm /var/log/journal/remote/*
On each client we run (side by side / simultaneously:
for x in {1..10000}; do echo "EVENT $x" | logger ; done
So we expect 20,000 records on the server, which we validate:
root@log-server:/home/ansible# journalctl -D /var/log/journal/remote/ --no-pager | grep -v Boot | wc -l
20000
I've tested this many times to confirm that the systemd-journal-remote/upload mechanism is OK.
The failing code
import select
from systemd import journal
j = journal.Reader(path="/var/log/journal/remote")
j.seek_tail()
j.get_previous() # prevent wrap around
p = select.poll()
p.register(j, j.get_events())
print("Starting....")
event_counter = 0
while p.poll():
for x in j:
event_counter += 1
if j.process() != journal.APPEND:
print("...WAITING....")
else:
print(f"Event counter: {event_counter}")
First of all, I'm not sure if this code is right. If I've made a glaring mistake, I apologise in advance for this report.
That said, I get correct results if I run my '10K log generator command' on one client at a time.
Event counter: 9915
Event counter: 9933
Event counter: 9945
Event counter: 9950
Event counter: 9987
Event counter: 10000
If I run the command at the same time on each client at once, I would expect 20K records, but it gets nowhere near that:
Event counter: 10228
Event counter: 10228
Event counter: 10228
Event counter: 10228
Event counter: 10228
Event counter: 10228
Event counter: 10228
In this example I'm missing almost half of all expected events. Yet I can confirm on the log server with the 'journalctl' command that there are actually 20K log records stored on the server.
Even if I test with just 1K events per client and run that simultaneously it's also clearly visible:
Event counter: 1096
Event counter: 1141
Event counter: 1215
Event counter: 1430
Happy to run any additional tests if that helps. If I made a huge mistake, please let me know.
Update 1 I've been testing this issue on virtual machines on my mac. I've also a few physical x86 micro pc's and the issue is the same. Rate-limiting the logging with sleep .01 doesn't impact the result.
Question, is python-systemd built to handle remote logging from multiple clients through polling?