Crons check-in at D+1
Environment
SaaS (https://sentry.io/)
Steps to Reproduce
We are experiencing random missed check-in notification due to check-in being linked to D+1.
Here is the situation on January 31st. Two crons are logged, one for today (missed) and one for tomorrow (ok).
Expected Result
We expected to have the check in to be linked to the correct date
Actual Result
Check in is linked to the D+1 job
Product Area
Crons
Link
No response
DSN
https://[email protected]/4503942203310080
Version
No response
Assigning to @getsentry/support for routing ⏲️
Routing to @getsentry/product-owners-crons for triage ⏲️
Hey @mdefeche, we appreciate reporting this issue to us. We can see something is off here -- our team is currently investigated what went wrong, and we'll get back to you on next steps.
It looks like raising the grace period to 5min from default helped mitigate the issue
Hey @mdefeche what SDK configuration are you using to report your cron jobs?
You have a large miss window (5 minutes) and your checks all look to be coming in on time. I have a few ideas of what could be happening here but it's going to take a bit more investigation on our part.
We record the 'next expected check-in time' on the monitor itself. So that time only moves forward when either a check in happens or a miss is recorded, we then record that timestamp onto the check-in as we process it. So what appears to be happening is that
- You produce a check-in at 2:00, however it get's stuck in an early processing pipeline where we've received it but not yet processed it
- Somehow that check is stuck for 5 minutes in that stage and we instead detect your monitor as being 5 minutes past it's expected check in time. We created a missed with the expected check in time, at this point the monitor is updated with a new next expected check-in time as tomorrow.
- The check-in you produced at 2:00 makes it completely through the processing pipeline, but since the miss was already detected, we detect this check-in as being for the next day (also see https://github.com/getsentry/sentry/issues/44393)
That's my working theory. We have a number of systems in place that are supposed to synchronize various timings (due to the distributed nature of how this system works). But it's very possible one of those systems may not be working as expected.
Will keep you updated as we investigate more.
Python SDK, version 2.17.0. Thank you for the explanation
Here's the process we have on our side and the reason why raising the grace period may have helped
- We use django-celery-beat to register tasks in redis
- Tasks is processed by Celery. However Celery is not always running, it's meant to start every hour, check if there are tasks in redis, process them is there are, shut down if there are none. Typically the process of starting Celery takes a bit more than a minute, which explains why there were some missed check in
2025-02-11T01:01:39.190Z | [2025-02-11 02:01:39,190: INFO/MainProcess] Task ************************ received
Raising the grace period helped but we still experience weird behaviors
We are April the first but still got a check in on April the 2nd
For another task, the production task ran in less than 10 minutes but was still considered as timed out (time out is configured at 30 min)
Those false negatives make it hard to rely on this monitoring
Hey thanks for following up @mdefeche
I took a look again, we added a fair bit more telemetry and now I am pretty certain there's a long delay (~12m) in processing some of your check-ins in our ingestion service (relay). This is why you're ending up with misses followed by an OK for the future period
I need to figure out what's going on upstream before I can give any definitive answer. Will let you know!
Hey @mdefeche I've looked closer at this and I think my assessment is incorrect about the delay processing your check-ins. I think what's actually happening here is that your in_progress check-in may not be getting sent, so what's happening is:
04:00 CEST → `in_progress` should be sent to sentry, but it is not
04:05 CEST → check-in expected at this time is marked as missed
04:12 CEST → `ok` check-in is received, because it has a duration of `12 minutes` it's backdated to 04:00 CEST start time.
Unfortunately we don't do a great job of showing check-in history for individual jobs, I've brought this up with the team as something we should fix so that it's more clear why scenarios like this are happening.
Could you expand more on how you're using the sentry SDK for reporting your jobs? Are you using the Celery auto-discovery?
Yes we're using the following integrations: [DjangoIntegration(), CeleryIntegration(monitor_beat_tasks=True)]. We use django-celery-beat to record schedules
Thank you for taking the time to investigate this.
Cool, I'm planning to add a bit more historic data on check-ins to track when the initial in_progress check-in was recieved, so we'll be able to better understand if those check-ins are being lost for some reason.
@antonpirker is there a way with the python SDK to enable some kind of logging that reports when the in-progress and OK check-ins are sent?
It looks like this happens systematically for two crons for the same environment (ie. production). The pattern is often the same, check-in happens a bit before expected, the task is registered for the D+1 day and marked as "Okay". The task on D day is marked as "Missed". All environments are configured the same way through IaC and crons are configured with the same fixtures so I'm not sure why the check in always happens too early for that environment.
Hey @mdefeche thanks for the additional details. We’re still investigating and hope to come back with more information on this soon.
For now you can raise your missed window to longer than the runtime of the job and this should avoid the problem (I know this isn't ideal!)
Sorry didn't mean to close this 😅
Hey @evanpurkhiser , I raised the missed window as suggested but this hasn't help. Any news on this ?
Can you elaborate more what the behavior is when the missed window is larger than the runtime of the entire job?
Issue arises only for 2 of the 19 crons we have, and only in production (same architecture: AWS ECS Fargate)
Setup
Duration smaller than grace period
Is there specific logging that can be turned on to help you troubleshoot this issue further ?
Logs look fine
[2025-05-13 03:01:24,820: INFO/MainProcess] Task ***************************** [2234dcb0-b326-448e-8928-25ff36487e29] received
...
[2025-05-13 03:16:01,369: INFO/ForkPoolWorker-1] Task ***************************** [2234dcb0-b326-448e-8928-25ff36487e29] succeeded in 876.54832333
damn, didn't meant to close this sorry! 🤦
I'm adding a bunch more timing metrics right now which should help to debug these. I'll leave a comment when they are available :)
Issue arises only for 2 of the 19 crons we have, and only in production (same architecture: AWS ECS Fargate)
It looks like they're being marked as Timed out now right?
Originally we had 2 tasks scheduled for 3am, with a single celery worker (grace period and time out were accounted for those 2 tasks). Tasks was timed out most of the time since we raised the grace period as per your recommendation
I did some changes a couple of days ago to only have one single task at 3am and now check-in are missed bc the check-in happened one minute too early (red) even though the task was started at 3:01am in the logs, or for some unknown reason (blue)
(btw you can actually filter to environment on the timeline like this)
And what did you say your miss window is configured to?
Here's a filtered view:
I changed grace period from 55 to 10 min
My suspicion of what's going on here is that some of your check-ins are not making it through, here's the two scenarios I suspect:
-
When you're getting a OK followed by a miss (or other way around) that's happening because no
in_progresscheck-in is being sent at the start of your job. Then anokis being sent once the job has completed. But what's happening is that the job is completing AFTER we've already marked your as missed. Your job takes 12-20 minutes, but your miss window is 10 minutes. What's confusing here is that theStartedtime is actually being back-dated based on the duration of your job when we only receive an OK check-in. So we did't know the job was running when we marked it as missed at 03:10 CEST, then at 03:12 CEST we get your OK and were back-dating it as 03:12 - 12 minutes (and probably some seconds, which is why it ends up at 02:59:59) -
When you're getting a TIMEOUT it's because you sent an in-progress (which is why there is no miss, the in progress was sent on time when the job started), however we never got the closing OK check in. If you WERE sending a terminal OK / ERROR after it was marked time-out we would actually indicate a duration, but since there's no duration we know no closing check-in was sent.
I am updating the crons check-in table now with a number of more diagnostics so we can validate this is what's happening.
I'll have to rope in some of the SDK people to understand why your check-ins may not be getting sent.
Hey again @mdefeche -- We've finally gotten around to updating this checks table with much more detail (https://github.com/getsentry/sentry/pull/91981)
(note that we started writing timestamps slightly differently starting around the 16th, check-ins before that may be represented in the checks table incorrectly)
It should be much more obvious what's going on with your monitor now.
It definitely does look like we have some kind of bug with the python celery integration (we're actually seeing something similar with some of our celery jobs). Will put together a bug report for our SDK and track the fix there 👍
Nice, thank you for the follow up !
Here's a new screenshot after the update. Do I need to increase the grace period to be longer than the execution time ?
That would definitely solve the problem yes, but the real issue here seems to be that your celery integration isn't sending a in-progress check-in, which would be the more correct fix.
We use django-celery-beat to register tasks in redis However Celery is not always running, it's meant to start every hour, check if there are tasks in redis
This is sounding like it could be part of the problem. Could you expand upon how this system works? @antonpirker maybe you could also share a bit of detail about how the celery integration typically handles sending in-progress and complete check-ins.
We have 16 different crons, most of them running once a day, on 3 different environments with all the same setup on AWS (IaC). We consistently have the early check-in issue on a unique cron and on a unique environment. The other crons work fine !
- Every hour an event is sent from an event bus to a lambda
- Lambda checks if Celery is running (ECS service in a cluster), and starts it if it's not running
- Another service in a cluster runs django-celery-beat that sends a message to redis whenever a task is scheduled
- Celery picks the task from redis if any, otherwise it's shutdown by another lambda (also triggered by an event)
We use the aforementioned architecture to limit cost (Celery runs on a big instance). All crons are handled the same way
In our logs, everything looks ok. Task is usually picked one or two minutes after the schedule (which is OK for us).