sentry-python icon indicating copy to clipboard operation
sentry-python copied to clipboard

Monitor intermitten failures on long running tasks

Open anze3db opened this issue 2 years ago • 14 comments

How do you use Sentry?

Sentry Saas (sentry.io)

Version

1.31.0

Steps to Reproduce

  1. Have a job that runs for over 1hour
  2. Set up a Cron Monitor with 300mins of run threshold Screenshot 2023-10-08 at 09 29 01
  3. Use the monitor in the code:
    @monitor(monitor_slug="daily-sync")
    def job(self):
        self.console.print("Starting daily job")
        self.console.print("Running crawler")
        with monitor(monitor_slug="daily-sync-crawler"):
            management.call_command("crawler", skip_inactive_for=3, pre_filter=True)
        self.console.print("Running indexer")
        with monitor(monitor_slug="daily-sync-indexer"):
            management.call_command("indexer")
        self.console.print("Running optimizer")
        with monitor(monitor_slug="daily-sync-optimizer"):
            management.call_command("optimizer")
        self.console.print("Running statuser")
        with monitor(monitor_slug="daily-sync-statuser"):
            management.call_command("statuser")
        self.console.print("Running dailystats")
        with monitor(monitor_slug="daily-sync-stats"):
            management.call_command("dailystats")
        self.console.print("All done! 🎉")

Expected Result

No alerts when the job finishes without errors as seen by the logs:

Oct 08 02:00:00 raspberrypi bash[405]: Starting daily job
Oct 08 02:00:00 raspberrypi bash[405]: Running crawler
Oct 08 02:07:00 raspberrypi bash[405]: Running optimizer
Oct 08 02:21:31 raspberrypi bash[405]: Running statuser
Oct 08 03:29:49 raspberrypi bash[405]: Running dailystats
Oct 08 03:29:49 raspberrypi bash[405]: Gathering daily stats
Oct 08 03:29:50 raspberrypi bash[405]: All done! 🎉

Actual Result

The monitor sometimes fails on the job that takes more than 1hour:

Screenshot 2023-10-08 at 09 33 14

Even though the monitor after it (daily-sync-stats) worked ok:

Screenshot 2023-10-08 at 09 35 11

And interestingly, the daily-sync monitor which wraps the whole job function also passed without problems.

I initially only had the daily-sync monitor configured since I don't really need every step monitored, but when I only had a single monitor it failed consistently every day. With multiple monitors the failures are more intermittent.

I'm also not sure if it's relevant, but the job function is part of a always running Django Command that looks like this:

    def handle(self, *args, run_now=False, **options):
        self.console.print("Starting scheduler 🕐")
        schedule.every().day.at("01:00").do(self.job)

        while True:
            schedule.run_pending()
            time.sleep(1)

anze3db avatar Oct 08 '23 08:10 anze3db

Hey @anze3db ! Thanks for reporting this. Indeed this should not happen. The fact that daily-sync passes is OK, because monitors are independent and we do not have a concept of child-monitors and thus do not check for status of child-monitors.

What are the grace and timeout settings of the daily-sync and the other sub-monitors?

antonpirker avatar Oct 09 '23 07:10 antonpirker

The fact that daily-sync passes is OK, because monitors are independent and we do not have a concept of child-monitors and thus do not check for status of child-monitors.

Yes, I agree. I just thought I'd mention that the wrapper monitor passed. I think this should rule out any errors in my function.

What are the grace and timeout settings of the daily-sync and the other sub-monitors?

All the monitors have a grace period of 60 minutes and max runtime at 300 minutes:

Screenshot 2023-10-09 at 09 23 49.

I can try to make the values less loose and report back if the issue still reproduces.

anze3db avatar Oct 09 '23 08:10 anze3db

Just an update that the job timed out today with the following threshold settings:

Screenshot 2023-10-10 at 14 24 54

Screenshot 2023-10-10 at 14 30 36

Let me know if there is anything I can do to help reproduce/resolve this.

anze3db avatar Oct 10 '23 13:10 anze3db

Thanks for the update. Hm, indeed very strange. Currently I do not have even a hunch why this is happening. All your daily tasks are Django manage commands and look the same. Is there anything in the Statsuser tasks that is fundamentally different compared to the other tasks? Does it use async code for example, or anything else you can think of?

antonpirker avatar Oct 11 '23 07:10 antonpirker

The only difference is that Statsuser runs the longest for more than an hour while the other tasks are all less than 20minutes long.

Statuser does use async, but in exactly the same way as the Crawler command. I use the @async_to_sync in both cases to get around the fact that Django Commands are sync only:

    # Main entry point for a Django Command:
    def handle(
        self,
        *args,
        offset=0,
        instances=None,
        skip_inactive_for: int = 0,
        pre_filter: bool = False,
        **options,
    ):
        self.main(
            offset=offset,
            instances=instances,
            skip_inactive_for=skip_inactive_for,
            pre_filter=pre_filter,
        )

    # My Async method that does all the work:
    @async_to_sync
    async def main(
        self,
        offset: int,
        instances: str | None,
        skip_inactive_for: int = 0,
        pre_filter: bool = False,
    ):
         ...

I can try rewriting the code to not use async to see if async is the cause of this problem.

anze3db avatar Oct 11 '23 16:10 anze3db

Even without any async code in Statuser the monitor still failed today. 😔

anze3db avatar Oct 12 '23 06:10 anze3db

Dang. And the monitor is marked as "Timed out" right? One thing is that you could check if the status of the monitor a bit over an hour into processing is set to running or timed out. Because Sentry only sets it to timed out after the full 5 hours. If it is set to timed out before then something strange is going on.

antonpirker avatar Oct 12 '23 06:10 antonpirker

And the monitor is marked as "Timed out" right?

Correct.

One thing is that you could check if the status of the monitor a bit over an hour into processing is set to running or timed out.

I've just done that. It is in the in progress state while it's processing and stays in the in progress state when the job finishes:

Screenshot 2023-10-12 at 13 22 26

anze3db avatar Oct 12 '23 12:10 anze3db

So it stays "in progress" for the whole 5 hours and then is set to "failed"? Is this correct?

antonpirker avatar Oct 16 '23 14:10 antonpirker

Yes, correct.

anze3db avatar Oct 17 '23 08:10 anze3db

Ok, thanks for confirming. I still dont know what is going wrong. I have asked the team handling the checkins of crons in our backend, if they have any ideas what is going wrong...

antonpirker avatar Oct 20 '23 06:10 antonpirker

Just wanted to check in, if you try with the newest Version of the SDK, does this still apply?

antonpirker avatar Dec 15 '23 09:12 antonpirker

Yesterday, I reenabled the monitor that was timing out before, and today it looks like it passed:

Screenshot 2023-12-16 at 10 40 32

Looks like the issue has been resolved, so I'm closing this, but I'll reopen if I see any more failures. Thank you for taking care of this!

anze3db avatar Dec 16 '23 10:12 anze3db

Oh no, I am reopening the issue because it has just happened again:

Screenshot 2023-12-18 at 13 19 04

I see from my logs that there were no errors and that the statuser job finished after about 30minutes:

Dec 18 01:05:08 raspberrypi bash[758000]: Running optimizer
Dec 18 01:05:08 raspberrypi bash[758000]: Running statuser
Dec 18 01:05:15 raspberrypi bash[758000]: Batch 0 done, sleeping for 90s
Dec 18 01:06:55 raspberrypi bash[758000]: Batch 1 done, sleeping for 90s
Dec 18 01:08:31 raspberrypi bash[758000]: Batch 2 done, sleeping for 90s
Dec 18 01:10:47 raspberrypi bash[758000]: Batch 3 done, sleeping for 90s
Dec 18 01:12:26 raspberrypi bash[758000]: Batch 4 done, sleeping for 90s
Dec 18 01:14:08 raspberrypi bash[758000]: Batch 5 done, sleeping for 90s
Dec 18 01:15:45 raspberrypi bash[758000]: Batch 6 done, sleeping for 90s
Dec 18 01:17:22 raspberrypi bash[758000]: Batch 7 done, sleeping for 90s
Dec 18 01:18:58 raspberrypi bash[758000]: Batch 8 done, sleeping for 90s
Dec 18 01:20:35 raspberrypi bash[758000]: Batch 9 done, sleeping for 90s
Dec 18 01:22:12 raspberrypi bash[758000]: Batch 10 done, sleeping for 90s
Dec 18 01:23:48 raspberrypi bash[758000]: Batch 11 done, sleeping for 90s
Dec 18 01:25:25 raspberrypi bash[758000]: Batch 12 done, sleeping for 90s
Dec 18 01:27:02 raspberrypi bash[758000]: Batch 13 done, sleeping for 90s
Dec 18 01:28:39 raspberrypi bash[758000]: Batch 14 done, sleeping for 90s
Dec 18 01:30:16 raspberrypi bash[758000]: Batch 15 done, sleeping for 90s
Dec 18 01:31:53 raspberrypi bash[758000]: Batch 16 done, sleeping for 90s
Dec 18 01:33:30 raspberrypi bash[758000]: Batch 17 done, sleeping for 90s
Dec 18 01:35:07 raspberrypi bash[758000]: Batch 18 done, sleeping for 90s
Dec 18 01:36:41 raspberrypi bash[758000]: Batch 19 done, sleeping for 90s
Dec 18 01:38:11 raspberrypi bash[758000]: Running dailystats
Dec 18 01:38:11 raspberrypi bash[758000]: Gathering daily stats
Dec 18 01:38:22 raspberrypi bash[758000]: All done! 🎉

anze3db avatar Dec 18 '23 13:12 anze3db

Hey @anze3db, are you still seeing this issue? We've recently fixed some behavior where traces were marked as errored out even when there was no error in some circumstances, so I'm wondering if that could also apply here. Are you able to upgrade to the latest SDK version and try it out?

sentrivana avatar Mar 13 '25 12:03 sentrivana

@sentrivana hey, I've removed the Sentry monitor integration a while ago so I'm not longer able to test or reproduce this. I'll close the issue for now and reopen it if I see it happen again.

anze3db avatar Mar 13 '25 13:03 anze3db

Understood, sorry to hear that but thank you for following up.

sentrivana avatar Mar 13 '25 13:03 sentrivana