Argus icon indicating copy to clipboard operation
Argus copied to clipboard

Reporting errors in Argus itself

Open hmpf opened this issue 4 years ago • 6 comments

The types of errors that argus can report about itself, for instance: Failure to send a notification because the notification-endpoint isn't answering (email server down, say), should be reported as a incident.

This means we need a SourceType "argus" and a SourceSystem representing the host argus is running on. Named "self" maybe? "me"? I suspect hostname would be tricky. Also, a function/method argus can use to write to the incidents-table, with SourceType/SourceSystem locked.

(This is very nice, because we can dogfood the system using itself, triggering errors in argus in order to have incidents turn up in argus :) )

hmpf avatar Aug 03 '20 12:08 hmpf

These objects should be get_or_created early and easily and often. Maybe a management command named "setup" or "verify" or something, or get_or_created on each use of the function.

The first use of the function could be when setting up the system for the first time, a "Hello, World" incident, low severity!

hmpf avatar Aug 03 '20 12:08 hmpf

There now is a way to auto-create an argus user/source/source type. What's left is to create a suitable incident every time argus complains about something in its logs.

hmpf avatar Oct 26 '20 11:10 hmpf

This feature sounds very useful!

Still, I am a bit worried that, in certain cases, Argus might overload with its own error messages if this were implemented naively. Where an error triggers an incident, which matches a filter, is sent out by mail, which causes another error that triggers another incident, ad infinitum: Congrats on DoSing yourself and/or taking a whole Argus instance down.

So, two requirements that should be met before implementation

  1. Needs a clear, exhaustive, written spec which errors can cause incidents and which do not.
  2. A mechanism to prevent choking on its own incidents. Some filtering message queue, or another mechanism for rate limiting.

katsel avatar Jan 18 '21 17:01 katsel

Removing "good first issue" tag for aforementioned issues. The actual code change may be easy to make, but it seems wise to reduce the threat vector a bit before tackling an implementation.

katsel avatar Jan 18 '21 17:01 katsel

  • Audit logs are another source of debugging information, so this one is low-priority.

Another approach would be sending a notification through Argus without creating an incident. Details to be discussed later.

katsel avatar Jan 19 '21 08:01 katsel

This came up again in #760.

johannaengland avatar Apr 18 '24 06:04 johannaengland