sentry icon indicating copy to clipboard operation
sentry copied to clipboard

Crons check-in at D+1

Open mdefeche opened this issue 11 months ago • 32 comments

Environment

SaaS (https://sentry.io/)

Steps to Reproduce

We are experiencing random missed check-in notification due to check-in being linked to D+1.

Here is the situation on January 31st. Two crons are logged, one for today (missed) and one for tomorrow (ok).

Image

Expected Result

We expected to have the check in to be linked to the correct date

Actual Result

Check in is linked to the D+1 job

Product Area

Crons

Link

No response

DSN

https://[email protected]/4503942203310080

Version

No response

mdefeche avatar Jan 31 '25 09:01 mdefeche

Assigning to @getsentry/support for routing ⏲️

getsantry[bot] avatar Jan 31 '25 09:01 getsantry[bot]

Routing to @getsentry/product-owners-crons for triage ⏲️

getsantry[bot] avatar Jan 31 '25 11:01 getsantry[bot]

Hey @mdefeche, we appreciate reporting this issue to us. We can see something is off here -- our team is currently investigated what went wrong, and we'll get back to you on next steps.

gaprl avatar Feb 11 '25 18:02 gaprl

It looks like raising the grace period to 5min from default helped mitigate the issue

mdefeche avatar Feb 11 '25 19:02 mdefeche

Hey @mdefeche what SDK configuration are you using to report your cron jobs?

You have a large miss window (5 minutes) and your checks all look to be coming in on time. I have a few ideas of what could be happening here but it's going to take a bit more investigation on our part.

We record the 'next expected check-in time' on the monitor itself. So that time only moves forward when either a check in happens or a miss is recorded, we then record that timestamp onto the check-in as we process it. So what appears to be happening is that

  1. You produce a check-in at 2:00, however it get's stuck in an early processing pipeline where we've received it but not yet processed it
  2. Somehow that check is stuck for 5 minutes in that stage and we instead detect your monitor as being 5 minutes past it's expected check in time. We created a missed with the expected check in time, at this point the monitor is updated with a new next expected check-in time as tomorrow.
  3. The check-in you produced at 2:00 makes it completely through the processing pipeline, but since the miss was already detected, we detect this check-in as being for the next day (also see https://github.com/getsentry/sentry/issues/44393)

That's my working theory. We have a number of systems in place that are supposed to synchronize various timings (due to the distributed nature of how this system works). But it's very possible one of those systems may not be working as expected.

Will keep you updated as we investigate more.

evanpurkhiser avatar Feb 11 '25 20:02 evanpurkhiser

Python SDK, version 2.17.0. Thank you for the explanation

Here's the process we have on our side and the reason why raising the grace period may have helped

  1. We use django-celery-beat to register tasks in redis
  2. Tasks is processed by Celery. However Celery is not always running, it's meant to start every hour, check if there are tasks in redis, process them is there are, shut down if there are none. Typically the process of starting Celery takes a bit more than a minute, which explains why there were some missed check in
 2025-02-11T01:01:39.190Z | [2025-02-11 02:01:39,190: INFO/MainProcess] Task ************************ received

mdefeche avatar Feb 11 '25 21:02 mdefeche

Raising the grace period helped but we still experience weird behaviors

We are April the first but still got a check in on April the 2nd Image

For another task, the production task ran in less than 10 minutes but was still considered as timed out (time out is configured at 30 min) Image

Those false negatives make it hard to rely on this monitoring

mdefeche avatar Apr 01 '25 07:04 mdefeche

Hey thanks for following up @mdefeche

I took a look again, we added a fair bit more telemetry and now I am pretty certain there's a long delay (~12m) in processing some of your check-ins in our ingestion service (relay). This is why you're ending up with misses followed by an OK for the future period

I need to figure out what's going on upstream before I can give any definitive answer. Will let you know!

evanpurkhiser avatar Apr 01 '25 22:04 evanpurkhiser

Hey @mdefeche I've looked closer at this and I think my assessment is incorrect about the delay processing your check-ins. I think what's actually happening here is that your in_progress check-in may not be getting sent, so what's happening is:

04:00 CEST → `in_progress` should be sent to sentry, but it is not
04:05 CEST → check-in expected at this time is marked as missed
04:12 CEST → `ok` check-in is received, because it has a duration of `12 minutes` it's backdated to 04:00 CEST start time.

Unfortunately we don't do a great job of showing check-in history for individual jobs, I've brought this up with the team as something we should fix so that it's more clear why scenarios like this are happening.

Could you expand more on how you're using the sentry SDK for reporting your jobs? Are you using the Celery auto-discovery?

evanpurkhiser avatar Apr 02 '25 16:04 evanpurkhiser

Yes we're using the following integrations: [DjangoIntegration(), CeleryIntegration(monitor_beat_tasks=True)]. We use django-celery-beat to record schedules

Thank you for taking the time to investigate this.

mdefeche avatar Apr 02 '25 21:04 mdefeche

Cool, I'm planning to add a bit more historic data on check-ins to track when the initial in_progress check-in was recieved, so we'll be able to better understand if those check-ins are being lost for some reason.

@antonpirker is there a way with the python SDK to enable some kind of logging that reports when the in-progress and OK check-ins are sent?

evanpurkhiser avatar Apr 03 '25 16:04 evanpurkhiser

It looks like this happens systematically for two crons for the same environment (ie. production). The pattern is often the same, check-in happens a bit before expected, the task is registered for the D+1 day and marked as "Okay". The task on D day is marked as "Missed". All environments are configured the same way through IaC and crons are configured with the same fixtures so I'm not sure why the check in always happens too early for that environment.

Image

mdefeche avatar Apr 15 '25 13:04 mdefeche

Hey @mdefeche thanks for the additional details. We’re still investigating and hope to come back with more information on this soon.

For now you can raise your missed window to longer than the runtime of the job and this should avoid the problem (I know this isn't ideal!)

evanpurkhiser avatar Apr 16 '25 22:04 evanpurkhiser

Sorry didn't mean to close this 😅

evanpurkhiser avatar Apr 16 '25 22:04 evanpurkhiser

Hey @evanpurkhiser , I raised the missed window as suggested but this hasn't help. Any news on this ?

mdefeche avatar May 13 '25 09:05 mdefeche

Can you elaborate more what the behavior is when the missed window is larger than the runtime of the entire job?

evanpurkhiser avatar May 13 '25 15:05 evanpurkhiser

Issue arises only for 2 of the 19 crons we have, and only in production (same architecture: AWS ECS Fargate) Image

Setup Image

Duration smaller than grace period Image

Is there specific logging that can be turned on to help you troubleshoot this issue further ?

Logs look fine

[2025-05-13 03:01:24,820: INFO/MainProcess] Task ***************************** [2234dcb0-b326-448e-8928-25ff36487e29] received
...
[2025-05-13 03:16:01,369: INFO/ForkPoolWorker-1] Task ***************************** [2234dcb0-b326-448e-8928-25ff36487e29] succeeded in 876.54832333

mdefeche avatar May 13 '25 20:05 mdefeche

damn, didn't meant to close this sorry! 🤦

evanpurkhiser avatar May 15 '25 18:05 evanpurkhiser

I'm adding a bunch more timing metrics right now which should help to debug these. I'll leave a comment when they are available :)

evanpurkhiser avatar May 15 '25 18:05 evanpurkhiser

Issue arises only for 2 of the 19 crons we have, and only in production (same architecture: AWS ECS Fargate)

It looks like they're being marked as Timed out now right?

evanpurkhiser avatar May 15 '25 18:05 evanpurkhiser

Originally we had 2 tasks scheduled for 3am, with a single celery worker (grace period and time out were accounted for those 2 tasks). Tasks was timed out most of the time since we raised the grace period as per your recommendation Image

I did some changes a couple of days ago to only have one single task at 3am and now check-in are missed bc the check-in happened one minute too early (red) even though the task was started at 3:01am in the logs, or for some unknown reason (blue) Image

mdefeche avatar May 15 '25 20:05 mdefeche

(btw you can actually filter to environment on the timeline like this)

Image

evanpurkhiser avatar May 15 '25 20:05 evanpurkhiser

And what did you say your miss window is configured to?

evanpurkhiser avatar May 15 '25 20:05 evanpurkhiser

Here's a filtered view: Image Image I changed grace period from 55 to 10 min Image

mdefeche avatar May 15 '25 20:05 mdefeche

My suspicion of what's going on here is that some of your check-ins are not making it through, here's the two scenarios I suspect:

  • When you're getting a OK followed by a miss (or other way around) that's happening because no in_progress check-in is being sent at the start of your job. Then an ok is being sent once the job has completed. But what's happening is that the job is completing AFTER we've already marked your as missed. Your job takes 12-20 minutes, but your miss window is 10 minutes. What's confusing here is that the Started time is actually being back-dated based on the duration of your job when we only receive an OK check-in. So we did't know the job was running when we marked it as missed at 03:10 CEST, then at 03:12 CEST we get your OK and were back-dating it as 03:12 - 12 minutes (and probably some seconds, which is why it ends up at 02:59:59)

  • When you're getting a TIMEOUT it's because you sent an in-progress (which is why there is no miss, the in progress was sent on time when the job started), however we never got the closing OK check in. If you WERE sending a terminal OK / ERROR after it was marked time-out we would actually indicate a duration, but since there's no duration we know no closing check-in was sent.

I am updating the crons check-in table now with a number of more diagnostics so we can validate this is what's happening.

I'll have to rope in some of the SDK people to understand why your check-ins may not be getting sent.

evanpurkhiser avatar May 15 '25 20:05 evanpurkhiser

Hey again @mdefeche -- We've finally gotten around to updating this checks table with much more detail (https://github.com/getsentry/sentry/pull/91981)

(note that we started writing timestamps slightly differently starting around the 16th, check-ins before that may be represented in the checks table incorrectly)

It should be much more obvious what's going on with your monitor now.

Image

It definitely does look like we have some kind of bug with the python celery integration (we're actually seeing something similar with some of our celery jobs). Will put together a bug report for our SDK and track the fix there 👍

evanpurkhiser avatar May 20 '25 23:05 evanpurkhiser

Nice, thank you for the follow up !

mdefeche avatar May 21 '25 07:05 mdefeche

Here's a new screenshot after the update. Do I need to increase the grace period to be longer than the execution time ?

Image

mdefeche avatar Jun 11 '25 09:06 mdefeche

That would definitely solve the problem yes, but the real issue here seems to be that your celery integration isn't sending a in-progress check-in, which would be the more correct fix.

We use django-celery-beat to register tasks in redis However Celery is not always running, it's meant to start every hour, check if there are tasks in redis

This is sounding like it could be part of the problem. Could you expand upon how this system works? @antonpirker maybe you could also share a bit of detail about how the celery integration typically handles sending in-progress and complete check-ins.

evanpurkhiser avatar Jun 11 '25 15:06 evanpurkhiser

We have 16 different crons, most of them running once a day, on 3 different environments with all the same setup on AWS (IaC). We consistently have the early check-in issue on a unique cron and on a unique environment. The other crons work fine !

  1. Every hour an event is sent from an event bus to a lambda
  2. Lambda checks if Celery is running (ECS service in a cluster), and starts it if it's not running
  3. Another service in a cluster runs django-celery-beat that sends a message to redis whenever a task is scheduled
  4. Celery picks the task from redis if any, otherwise it's shutdown by another lambda (also triggered by an event)

We use the aforementioned architecture to limit cost (Celery runs on a big instance). All crons are handled the same way

In our logs, everything looks ok. Task is usually picked one or two minutes after the schedule (which is OK for us).

mdefeche avatar Jun 11 '25 21:06 mdefeche