logfire icon indicating copy to clipboard operation
logfire copied to clipboard

Logs slipping through alert timing

Open preritdas opened this issue 1 year ago • 8 comments

Description

TLDR: It seems to me that because 1-minute alerts actually run every ~66 seconds, there's a ~6-second window wherein logs are uncaught.

Setup

My goal was to create an alert that runs once, and only once, when a matching log arises (i.e. anything above an INFO). Semantically, it makes sense to configure this with an alert that runs every minute, has one minute of history, and only alerts when it has matches.

CleanShot 2024-08-21 at 18 05 43

Observation

This alert actually runs every ~66 seconds, not every 60 seconds, which got me worried. Because the alert has 60 seconds of history at runtime, what happens if an error arises [0, 6) seconds after the previous run?

Example:

  • Alert check runs at 1:00:00 âś…
  • Alert check runs at 1:01:06 (observability window of 1:00:06 - 1:01:06) âś…
  • Error occurs at 1:01:07 ❌
  • Alert check runs at 1:02:12 (observability window of 1:01:12 - 1:02:12) âś…
    • The error should be caught here, but isn't. 🤔

An example in real life: Here are two alert check runs. Note the offset---the first one happens at 18:03:39, the second happens at 18:04:46, an offset of 7 seconds. CleanShot 2024-08-21 at 18 05 22

You probably see where I'm going this. Both checks were empty, and no alert was triggered. But... I sent a Logfire error from client code 5 seconds after the first check ran, at 18:03:44.

CleanShot 2024-08-21 at 18 06 53

Questions

  1. Am I missing something here? Please let me know if I'm ignorantly overlooking the answer.
  2. Is there a better way to achieve the logging alert strategy I'm after? (Explained in Setup)
  3. Why do the alert runs have this ~6 second offset?

Currently, I'm alerting when the query has matches every minute with a 2-minute lookback to ensure no logs are missed in the manner this issue describes. The downside is that I'm alerted of most matches twice.

Happy to elaborate on anything further!

Python, Logfire & OS Versions, related packages (not required)

No response

preritdas avatar Aug 22 '24 01:08 preritdas

Hey team, is there any update here? @alexmojaki @dmontagu

preritdas avatar Sep 25 '24 04:09 preritdas

Sorry for the lack of reply on this. We've discussed it at length internally, and it's a priority for us.

We'll get back to you later today with more details on our plan.

samuelcolvin avatar Sep 25 '24 08:09 samuelcolvin

Thank you! Please let me know if I can be helpful in any way.

preritdas avatar Sep 25 '24 16:09 preritdas

Hi @preritdas, thanks for your patience. I'm going to write up a more detailed summary of our current alerting system and how we are thinking about evolving that, but in the interest of addressing your issue in the short term, I just want to point out some functionality you may not be aware of:

When you configure an alert, you have three options for when you want to be notified: image

  1. "Query has matches" — this means a notification will be sent any time the query has matches, and is probably what you are currently seeing
  2. "Query matches change" — this means a notification will be sent any time the output of running the query changes. So if the query returns the same results for multiple runs in a row, no new notification is sent.
  3. "Query-has-matches changes" — this means a notification will be sent any time the query goes from returning a nonzero number of rows to returning zero rows, or vice versa. Changes to the specific contents of the rows will not trigger new notifications, you'll just be notified once the query starts returning zero rows.

Choices 2 and 3 correspond somewhat more closely to an incident-based notification scheme, where you are not constantly notified on an ongoing basis, just when something changes.

Based on what you've described, would it work for you to change to using a 2 minute window (to ensure overlap) but to then also change the "Notify when" setting to one of the other options suggested above? If you have issues with the way that handles things, feedback would definitely be appreciated.

One thing I think we need to do either way is ensure that we change the execution frequency and window settings so that there is no option you can pick that would result in logs slipping through the cracks, as that is an obvious foot gun and there's no good reason to be able to configure an alert that way. But I am hoping to do that as a part of bigger changes to alerts that may require more iteration/feedback on design, I'll share my current notes on the issue here shortly...

dmontagu avatar Sep 25 '24 20:09 dmontagu

Okay, so regarding the bigger picture of alerts, I think there are three common scenarios that we may want to address. I'll give them the following names (and describe them in more detail below):

  • "Incidents"
  • "Event notifications"
  • "Reporting"

Our current system for alerts has influences from each of these three scenarios, but each of these scenarios has different needs. As a result, our system needs a bit of a redesign to tease these three use cases apart.

  1. The "Incidents" use case is when you want to be notified due to a threshold being exceeded, e.g., if the average response time for an endpoint increases dramatically, or the percentage of requests to a given endpoint returning a 500 status code goes above 1%, or CPU usage on your server exceeds 80% of what is available. These conditions can be written as a SQL query, and you want to be notified once when the SQL query returns a value that exceeds the threshold, and not again until the query returns a value below the threshold, indicating the incident is resolved.
  2. The "Event notifications" use case is where you just want to be notified any time that a particular event happens. This could be good things like "A new user signed up", or bad things like "a line of code I thought was unreachable got reached". For some small or especially stable projects, the query might even be "select * from records where is_exception". This is generally not the right approach when you expect a high volume of occurrences of the event, but it can be a useful way to ensure that you are notified about rare events that you are especially concerned about and know you want to be notified about.
  3. The "Reporting" use case is where you want to define a query, and get a regularly scheduled notification with the results of that query. This could be something like "Show me the counts of daily active users over the last week" or "Show me the P50 and P99 latency for each of my endpoints over the past 24 hours".

Our current alerts configuration UI is implicitly trying to handle each of these scenarios, with different shortcomings.

  • For "Incidents", which our current implementation arguably handles best today, you still need to write a query where a threshold is exceeded if and only if the query returns any rows. Any query that returns a value as a constant can be converted to return 0 vs. more-than-zero by way of CTEs:
SELECT AVG(end_timestamp - start_timestamp) AS avg_duration
FROM records
WHERE attributes ? 'http.status_code'

becomes

WITH base_query AS (
	SELECT AVG(end_timestamp - start_timestamp) AS avg_duration
	FROM records
	WHERE attributes ? 'http.status_code'
)
SELECT avg_duration FROM base_query WHERE avg_duration > <threshold>

I think in an ideal world you'd just be able to write the base query rather than also defining the auxiliary CTEs, and select the threshold matching logic outside of the SQL itself. And ideally you would not need to specify a query execution frequency at all, just the size of the rolling aggregation window.

  • Under our current implementation, "Event notifications" has the shortcomings you've noted in this issue of individual rows from the records table being considered multiple times or zero times depending on exactly when they are ingested, etc. In an ideal world, we would guarantee that each row is considered exactly once. It seems reasonable to me that for this scenario we could restrict queries to be a where clause for SELECT * FROM records WHERE <filter>, similar to the live view, and possibly even use the live view as a way to create/edit/review these queries.

  • "Reporting" is probably least well supported by our current implementation given we have a non-editable execution frequency of one minute for all alert queries, and for reporting use cases, you probably: want to be notified about the query results very infrequently, e.g. once a week; and want a notification regardless of the number of "matching" rows, or whether they have changed. And in many cases you probably want a chart visualization, not just a table of results.

I'll note that in both the Incidents and Reporting use cases, it's probably desirable to design the alert with a chart in mind, and the ideal implementation might link to some version of a dashboard.


I think one of the more important decisions before us is to what extent we focus on the "Incidents" use case vs. the "Event notifications" use case.

I see a clear path to a technical implementation that allows us to handle "Event notifications" where we consider each row in the records table exactly once when determining whether to send a notification, so nothing slips through the cracks and we don't send two notifications with a single span as their source (due to multiple scheduled queries). However, given the fact that the "Event notifications"-style alert can generally be transformed (by using CTEs or otherwise) into a threshold-based "Incident"-style alert (using the threshold count(*) > 0 on the base query used for the event notifications), it may be enough to just make the Incident/threshold-based alerting better and rely on that for the other use cases.

Feedback about your priorities here is definitely welcome; I think eventually we'll share more detail in a dedicated issue about this, but I do want to put a bit more effort into planning before doing that.


I'll just add one final note about implementation details that may be relevant. Today, when running scheduled alert queries, we apply the time window filter to the start_timestamp column of the records table. For the "Incident" use case, it may be preferable for various reasons to use the end_timestamp rather than start_timestamp (especially since that will be the time that, e.g., an exception is recorded). For the "Event notifications" use case, it may be preferable to use the created_at timestamp, since we don't care about aggregation on the timescale that the data was produced and just want to ensure we consider each row exactly once.

As noted, we'll provide more detail about the implementation as we make decisions going forward. If you have any opinions, or use cases that you want to ensure are handled well, please don't hesitate to share them!

dmontagu avatar Sep 25 '24 21:09 dmontagu

Thank you for the incredibly detailed response, @dmontagu. This makes a lot of sense. The only issue I have with your first suggestion (changing to query matches change or query-has-matches change) is that (correct me if I'm wrong) we'd still have multiple notifications for one event - first, when the log occurs, and second, when lack of the log in the window changes query-has-matches.

The most straightforward approach to me is ensuring the regular schedule of log checks is either exactly 60 seconds, or if any runs are not 60 seconds, the window is adjusted to include the offset (i.e. if a check runs 67 seconds after the prior, the window has n + 7 seconds of log access).

Always excited to see updates, including today's general release!

preritdas avatar Oct 01 '24 21:10 preritdas

Still fleshing out the plan for improvements to alerts but one of the shorter-term things I'm planning to do is to make it so the time intervals are at least more exact, and tolerant to hiccups in ingest. I don't want to promise anything but to give you a sense of the timeline I'm thinking — I'm hoping to make some meaningful improvements here in the next 1-2 weeks. Definitely have plans for more ambitious improvements to alerting/notifications that will take longer than that, but the specific issues called out in this issue I think should be fixable on a shorter timeline.

dmontagu avatar Oct 02 '24 22:10 dmontagu

Awesome! Don't mean to apply any pressure, what we have is fantastic already. Looking forward to it.

preritdas avatar Oct 02 '24 23:10 preritdas

Hi @dmontagu, just checking to see if there's any update here? Thanks!

preritdas avatar Dec 02 '24 02:12 preritdas

Ah sorry, I forgot to update this but things are now largely resolved. The alert runs now cover contiguous time ranges without gaps, and have for a few weeks. Please let me know if you are still noticing any issues with alert execution, but it should be safe now to use 1 minute as the window and the frequency without risking gaps.

I’ll note that we have a saved-search alert implementation we are still working on, and now using internally, that should address some of the other points I described above. Hoping to release that feature more publicly soon!

I’m going to close the issue because the “falling through the gaps” thing is resolved, but please feel free to reopen if there are other related issues that aren’t resolved or if you have questions.

dmontagu avatar Dec 02 '24 02:12 dmontagu

Thanks for the update. That's excellent! Adjusted the alert settings and they're not duplicating anymore. Appreciate your support.

preritdas avatar Dec 02 '24 03:12 preritdas

Two additional implementation details I will just explicitly call out:

  • We modified the behavior so that records are included in buckets based on their end_timestamp rather than their start_timestamp. This increases the likelihood that long-lived spans are included in some window, as bucketing by start_timestamp could result in a span being closed after all windows containing the span have already ended.
  • When running alert queries, we generate windows with a "watermark" duration of ten seconds, i.e., the buckets only include end_timestamps that are at least 10 seconds old when they run. While this means there is an at least ~10 second delay between when a match is received and a notification is emitted, this helps ensure we don't miss records in a given window due to processing/sending/etc. delays (client-side or otherwise).
    • (Note that this is primarily important for proper handling of aggregations; our upcoming "saved search" alerts do not have the same requirements around bucket consistency, so we should be able to reduce this extra notification latency for cases where saved search notifications are adequate.)

dmontagu avatar Dec 02 '24 21:12 dmontagu