python-systemd icon indicating copy to clipboard operation
python-systemd copied to clipboard

When monitoring /var/log/journal/remote events are missed when two remote hosts send messages at the same time (updates)

Open louwrentius opened this issue 7 months ago • 0 comments

Context

  • 1 x log server (VM on Apple M2)
  • 2 x log client (VM on Apple M2)
  • Using systemd-journal-remote on server (http)
  • Using systemd-journal-upload on clients (http)

Testing Method We first cleanup the server with:

rm /var/log/journal/remote/*

On each client we run (side by side / simultaneously:

for x in {1..10000}; do echo "EVENT $x" | logger ; done

So we expect 20,000 records on the server, which we validate:

root@log-server:/home/ansible# journalctl -D /var/log/journal/remote/ --no-pager | grep -v Boot | wc -l
20000

I've tested this many times to confirm that the systemd-journal-remote/upload mechanism is OK.

The failing code

import select
from systemd import journal

 j = journal.Reader(path="/var/log/journal/remote")
 j.seek_tail()
 j.get_previous() # prevent wrap around

 p = select.poll()
 p.register(j, j.get_events())

 print("Starting....")
 event_counter = 0
 while p.poll():
      for x in j:
          event_counter += 1
      if j.process() != journal.APPEND:
          print("...WAITING....")
      else:
          print(f"Event counter: {event_counter}")

First of all, I'm not sure if this code is right. If I've made a glaring mistake, I apologise in advance for this report.

That said, I get correct results if I run my '10K log generator command' on one client at a time.

Event counter: 9915
Event counter: 9933
Event counter: 9945
Event counter: 9950
Event counter: 9987
Event counter: 10000

If I run the command at the same time on each client at once, I would expect 20K records, but it gets nowhere near that:

Event counter: 10228
Event counter: 10228
Event counter: 10228
Event counter: 10228
Event counter: 10228
Event counter: 10228
Event counter: 10228

In this example I'm missing almost half of all expected events. Yet I can confirm on the log server with the 'journalctl' command that there are actually 20K log records stored on the server.

Even if I test with just 1K events per client and run that simultaneously it's also clearly visible:

Event counter: 1096
Event counter: 1141
Event counter: 1215
Event counter: 1430

Happy to run any additional tests if that helps. If I made a huge mistake, please let me know.

Update 1 I've been testing this issue on virtual machines on my mac. I've also a few physical x86 micro pc's and the issue is the same. Rate-limiting the logging with sleep .01 doesn't impact the result.

Question, is python-systemd built to handle remote logging from multiple clients through polling?

louwrentius avatar Jul 16 '24 22:07 louwrentius