beats icon indicating copy to clipboard operation
beats copied to clipboard

Replace `github.com/coreos/go-systemd/v22/sdjournal` by `journalctl`

Open belimawr opened this issue 7 months ago • 9 comments

Proposed commit message

github.com/coreos/go-systemd/v22/sdjournal is removed and Filebeat now calls journalctl directly to read journald entries.

sdjournal relies on libsystemd to read journal files and the active system journal, however due to a bug (https://github.com/systemd/systemd/pull/29456) in systemd, it crashes during journal rotation. Filebeat is affected by it, if the host has a libsystemd affected, during a journal rotation (usually only on high load) Filebeat will crash with a SIGBUS. There is no way to prevent or recover from this crash, it happens outside of our codebase, the SIGBUS is turned into a panic by the Go runtime and we cannot recover from it.

The bug has been fixed in Systemd v255, which is not widely used yet. So most systems out there Filebeat might crash when reading journal logs.

Because there is no way for Filebeat to avoid the crash, we decided to replace github.com/coreos/go-systemd/v22/sdjournal by calling journalctl directly and reading it stdout.

On hosts where Filebeat crashes when reading from journald, journalctl can successfully read all journal files. OpenTelemetry collector also calls journalctl and has no issues reading the journal during rotation.

Because the reading backend has changed, some configuration options have been removed and behaviours adapted to match journalctl.

Breaking changes: Changes that will prevent the journald input from starting:

  • include_matches.match does not accept the and and or keys any more.

Changes in the journald input behaviour:

  • backoff, max_backoff, cursor_seek_fallback have been removed
  • seek now has only 3 modes: since, head and tail.
  • If there is a cursor in the registry, it will always be used and the seek option will be ignored.

Checklist

  • [x] My code follows the style guidelines of this project
  • [x] I have commented my code, particularly in hard-to-understand areas
  • [x] I have made corresponding changes to the documentation
  • [x] I have made corresponding change to the default configuration files
  • [x] I have added tests that prove my fix is effective or that my feature works
  • [x] I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Disruptive User Impact

Even though the journald input is not GA yet, which makes breaking changes acceptable, this PR introduces breaking changes that will make certain configurations not work as expected or not to work at all.

Changes that will prevent the journald input from starting:

  • include_matches.match does not accept the and and or keys any more.

Changes in the journald input behaviour:

  • backoff, max_backoff, cursor_seek_fallback have been removed
  • seek now has only 3 modes: since, head and tail.
  • If there is a cursor in the registry, it will always be used and the seek option will be ignored.

Author's Checklist

  • [ ] Stress test the new input
  • [ ] Manual test to ensure all related issues are actually closed by this PR

How to test this PR locally

Using the following input configuration:

filebeat.inputs:
  - type: journald
    id: PR-testing

Start Filebeat and assert the journald messages are sent to the configured output.

To manually see the journald messages and compare with what you see in Filebeat's output, you can use:

journalctl --follow -o json | jq -c --sort-keys

This will print out all fields Filebeat can read.

Related issues

  • Closes #34077
  • Closes #32782
  • Closes #30398
  • Closes #39352
  • Closes https://github.com/elastic/elastic-agent/issues/4250
  • Closes https://github.com/elastic/beats/issues/39820

~~## Use cases~~ ~~## Screenshots~~ ~~## Logs~~

belimawr avatar Jun 28 '24 19:06 belimawr