Sentry is missing some of cron check-ins when used with sentry-python
How do you use Sentry?
Sentry Saas (sentry.io)
Version
1.39.1
Steps to Reproduce
I'm using this task to send check-ins from airflow:
def sentry_checkin(status: str, init_task_id: Optional[str] = None, **kwargs) -> str:
sentry_logging = LoggingIntegration(
level=logging.WARNING, # Capture warning and above as breadcrumbs
event_level=logging.ERROR, # Send errors as events
)
sentry_sdk.init(
environment="production", dsn=MY_PROJECT, integrations=[sentry_logging]
)
check_in_id = None
if init_task_id:
ti = kwargs["ti"]
check_in_id = ti.xcom_pull(task_ids=init_task_id)
return crons.capture_checkin(monitor_slug="my-monitor", check_in_id=check_in_id, status=status)
I use this task in my airflow dag, it's set up to run on every execution: before the main logic, and after. It runs on every execution.
I used to have celery and django integrations included because I needed them in the past. Removing them improved the results.
Expected Result
All check-ins to sentry are visible in the cron tab in sentry UI. There are no missed checkins when the tasks are working as expected.
The job runs every 10 minutes, I expect evenly spaced successful check-ins.
Actual Result
Both initial check-ins and completion check-ins may be missing from time to time.
With celery and django integrations the history of check-ins looks like this:
Without celery and django integrations:
Thanks for creating the issue @IevgeniiB !
We have plans to have propper support for Airflow for our Crons product soon. The goal is that you do not need to create checkins by hand but the Airflow integration will handle this.
We have plans to have propper support for Airflow for our Crons product soon. The goal is that you do not need to create checkins by hand but the Airflow integration will handle this.
@antonpirker This is great news! I noticed issues like that outside of airflow too, do you think it may be related to airflow?
I am not sure if it is related to Airflow. The tasks finish without errors, just the check-ins are not sent? Could you turn on debug=True in the Sentry SDKs init() and then have a look at the logs if there are some hints there?
Anton, thank you for the suggestion. I've tried it and saw the following:
- There is nothing suspicious in the debug logs.
- I noticed that sentry tries to load all integrations that it can find a package for. I disabled this behavior and nothing changed.
The logs before disabling the integrations, init checkin:
[2024-01-04, 14:40:11 EST] {__init__.py:113} DEBUG - Setting up integrations (with default = True)
[2024-01-04, 14:40:11 EST] {__init__.py:48} DEBUG - Did not import default integration sentry_sdk.integrations.falcon.FalconIntegration: Falcon not installed
[2024-01-04, 14:40:11 EST] {__init__.py:48} DEBUG - Did not import default integration sentry_sdk.integrations.fastapi.FastApiIntegration: Starlette is not installed
[2024-01-04, 14:40:11 EST] {__init__.py:48} DEBUG - Did not import default integration sentry_sdk.integrations.pyramid.PyramidIntegration: Pyramid not installed
[2024-01-04, 14:40:11 EST] {__init__.py:48} DEBUG - Did not import default integration sentry_sdk.integrations.rq.RqIntegration: RQ not installed
[2024-01-04, 14:40:11 EST] {__init__.py:48} DEBUG - Did not import default integration sentry_sdk.integrations.sanic.SanicIntegration: Sanic not installed
[2024-01-04, 14:40:11 EST] {__init__.py:48} DEBUG - Did not import default integration sentry_sdk.integrations.starlette.StarletteIntegration: Starlette is not installed
[2024-01-04, 14:40:11 EST] {__init__.py:164} DEBUG - Enabling integration logging
[2024-01-04, 14:40:11 EST] {__init__.py:164} DEBUG - Enabling integration argv
[2024-01-04, 14:40:11 EST] {__init__.py:164} DEBUG - Enabling integration atexit
[2024-01-04, 14:40:11 EST] {__init__.py:164} DEBUG - Enabling integration dedupe
[2024-01-04, 14:40:11 EST] {__init__.py:164} DEBUG - Enabling integration excepthook
[2024-01-04, 14:40:11 EST] {__init__.py:164} DEBUG - Enabling integration modules
[2024-01-04, 14:40:11 EST] {__init__.py:164} DEBUG - Enabling integration stdlib
[2024-01-04, 14:40:11 EST] {__init__.py:164} DEBUG - Enabling integration threading
[2024-01-04, 14:40:11 EST] {__init__.py:164} DEBUG - Enabling integration aiohttp
[2024-01-04, 14:40:11 EST] {__init__.py:164} DEBUG - Enabling integration boto3
[2024-01-04, 14:40:11 EST] {__init__.py:164} DEBUG - Enabling integration bottle
[2024-01-04, 14:40:11 EST] {__init__.py:164} DEBUG - Enabling integration celery
[2024-01-04, 14:40:11 EST] {__init__.py:164} DEBUG - Enabling integration django
[2024-01-04, 14:40:11 EST] {__init__.py:164} DEBUG - Enabling integration flask
[2024-01-04, 14:40:11 EST] {__init__.py:164} DEBUG - Enabling integration httpx
[2024-01-04, 14:40:11 EST] {__init__.py:164} DEBUG - Enabling integration redis
[2024-01-04, 14:40:11 EST] {__init__.py:164} DEBUG - Enabling integration sqlalchemy
[2024-01-04, 14:40:11 EST] {__init__.py:164} DEBUG - Enabling integration tornado
[2024-01-04, 14:40:11 EST] {client.py:282} DEBUG - Setting SDK name to 'sentry.python.django'
[2024-01-04, 14:40:11 EST] {transport.py:421} DEBUG - Sending envelope [envelope with 1 items (check_in)] project:<project id> host:<sentry ingest host>
The logs before disabling the integrations, success checkin:
[2024-01-04, 14:42:00 EST] {__init__.py:113} DEBUG - Setting up integrations (with default = True)
[2024-01-04, 14:42:00 EST] {__init__.py:48} DEBUG - Did not import default integration sentry_sdk.integrations.falcon.FalconIntegration: Falcon not installed
[2024-01-04, 14:42:00 EST] {__init__.py:48} DEBUG - Did not import default integration sentry_sdk.integrations.fastapi.FastApiIntegration: Starlette is not installed
[2024-01-04, 14:42:00 EST] {__init__.py:48} DEBUG - Did not import default integration sentry_sdk.integrations.pyramid.PyramidIntegration: Pyramid not installed
[2024-01-04, 14:42:00 EST] {__init__.py:48} DEBUG - Did not import default integration sentry_sdk.integrations.rq.RqIntegration: RQ not installed
[2024-01-04, 14:42:00 EST] {__init__.py:48} DEBUG - Did not import default integration sentry_sdk.integrations.sanic.SanicIntegration: Sanic not installed
[2024-01-04, 14:42:00 EST] {__init__.py:48} DEBUG - Did not import default integration sentry_sdk.integrations.starlette.StarletteIntegration: Starlette is not installed
[2024-01-04, 14:42:00 EST] {__init__.py:164} DEBUG - Enabling integration logging
[2024-01-04, 14:42:00 EST] {__init__.py:164} DEBUG - Enabling integration argv
[2024-01-04, 14:42:00 EST] {__init__.py:164} DEBUG - Enabling integration atexit
[2024-01-04, 14:42:00 EST] {__init__.py:164} DEBUG - Enabling integration dedupe
[2024-01-04, 14:42:00 EST] {__init__.py:164} DEBUG - Enabling integration excepthook
[2024-01-04, 14:42:00 EST] {__init__.py:164} DEBUG - Enabling integration modules
[2024-01-04, 14:42:00 EST] {__init__.py:164} DEBUG - Enabling integration stdlib
[2024-01-04, 14:42:00 EST] {__init__.py:164} DEBUG - Enabling integration threading
[2024-01-04, 14:42:00 EST] {__init__.py:164} DEBUG - Enabling integration aiohttp
[2024-01-04, 14:42:00 EST] {__init__.py:164} DEBUG - Enabling integration boto3
[2024-01-04, 14:42:00 EST] {__init__.py:164} DEBUG - Enabling integration bottle
[2024-01-04, 14:42:00 EST] {__init__.py:164} DEBUG - Enabling integration celery
[2024-01-04, 14:42:00 EST] {__init__.py:164} DEBUG - Enabling integration django
[2024-01-04, 14:42:00 EST] {__init__.py:164} DEBUG - Enabling integration flask
[2024-01-04, 14:42:00 EST] {__init__.py:164} DEBUG - Enabling integration httpx
[2024-01-04, 14:42:00 EST] {__init__.py:164} DEBUG - Enabling integration redis
[2024-01-04, 14:42:00 EST] {__init__.py:164} DEBUG - Enabling integration sqlalchemy
[2024-01-04, 14:42:00 EST] {__init__.py:164} DEBUG - Enabling integration tornado
[2024-01-04, 14:42:00 EST] {client.py:282} DEBUG - Setting SDK name to 'sentry.python.django'
[2024-01-04, 14:42:00 EST] {transport.py:421} DEBUG - Sending envelope [envelope with 1 items (check_in)] project:<project id> host:<sentry ingest host>
The logs after disabling the integrations, init checkin:
[2024-01-04, 16:50:11 EST] {__init__.py:113} DEBUG - Setting up integrations (with default = True)
[2024-01-04, 16:50:11 EST] {__init__.py:164} DEBUG - Enabling integration logging
[2024-01-04, 16:50:11 EST] {__init__.py:164} DEBUG - Enabling integration argv
[2024-01-04, 16:50:11 EST] {__init__.py:164} DEBUG - Enabling integration atexit
[2024-01-04, 16:50:11 EST] {__init__.py:164} DEBUG - Enabling integration dedupe
[2024-01-04, 16:50:11 EST] {__init__.py:164} DEBUG - Enabling integration excepthook
[2024-01-04, 16:50:11 EST] {__init__.py:164} DEBUG - Enabling integration modules
[2024-01-04, 16:50:11 EST] {__init__.py:164} DEBUG - Enabling integration stdlib
[2024-01-04, 16:50:11 EST] {__init__.py:164} DEBUG - Enabling integration threading
[2024-01-04, 16:50:11 EST] {client.py:282} DEBUG - Setting SDK name to 'sentry.python'
[2024-01-04, 16:50:11 EST] {transport.py:421} DEBUG - Sending envelope [envelope with 1 items (check_in)] project:<project id> host:<sentry ingest host>
The logs after disabling the integrations, success checkin:
[2024-01-04, 16:52:31 EST] {__init__.py:113} DEBUG - Setting up integrations (with default = True)
[2024-01-04, 16:52:31 EST] {__init__.py:164} DEBUG - Enabling integration logging
[2024-01-04, 16:52:31 EST] {__init__.py:164} DEBUG - Enabling integration argv
[2024-01-04, 16:52:31 EST] {__init__.py:164} DEBUG - Enabling integration atexit
[2024-01-04, 16:52:31 EST] {__init__.py:164} DEBUG - Enabling integration dedupe
[2024-01-04, 16:52:31 EST] {__init__.py:164} DEBUG - Enabling integration excepthook
[2024-01-04, 16:52:31 EST] {__init__.py:164} DEBUG - Enabling integration modules
[2024-01-04, 16:52:31 EST] {__init__.py:164} DEBUG - Enabling integration stdlib
[2024-01-04, 16:52:31 EST] {__init__.py:164} DEBUG - Enabling integration threading
[2024-01-04, 16:52:31 EST] {client.py:282} DEBUG - Setting SDK name to 'sentry.python'
[2024-01-04, 16:52:31 EST] {transport.py:421} DEBUG - Sending envelope [envelope with 1 items (check_in)] project:<project id> host:<sentry ingest host>
This looks all good. So the init checkin (setting status "in_progress") and the success checkin (setting the status "ok") are sent to Sentry every time, at the right time, is this correct? But still some of the checkins are marked as failed. Also the monitor_slug and check_in_id needs to be the same for the initial checkin and the ok checkin, this I guess you have checked already. Are there any errors during execution of the tasks?
Everything is correct:
- The jobs are running every 10 minutes without failures.
- Monitor slug and project id are correct and hardcoded.
- Based on the logs - check-ins are sent to sentry every time, but both
initandokmay be missing sometimes. - I just double-checked and the logs are identical between the runs where sentry shows that everything is good and something went wrong. The only difference - is
check_in_id.check_in_idis different for different runs, but it's the same within one run, even for the failed runs.
- There are no errors in the logs for the check-in task.
Is there something else that I should try? Maybe use http api instead of sentry-python?
Hey @IevgeniiB thanks for all the info, I asked our server side guys if they have an idea what can cause this.
Hi @antonpirker, is it possible that this is related to the fix in #2598? I don't have enough knowledge about Sentry to say so myself. But I noticed that apparently there hasn't been a release that contains this fix yet.
Hey @SoerenWeber, I don't think this is related to https://github.com/getsentry/sentry-python/pull/2598 since if the type was wrong, I'd expect no check-ins at all. But the issue here appears to be intermittent.
Thanks, @sentrivana, that's a good point. I made this connection because I see a similar behavior on my end. In my case, the behavior seems to be correlated with the amount of time the job takes to finish. If it's less than the grace period it's counted as successful in Sentry, otherwise a missed check-in is reported. With varying workload, this also looks like intermittent missing check-ins. But this might be a completely separate issue, of course.
Hey @IevgeniiB and @SoerenWeber, thanks for reaching out. Could you please share the affected monitor URLs with us so we can further investigate on our end? You can email it directly to us at [email protected]. Thanks.
@gaprl thank you for looking into it, I've sent the URL now. Please let me know if this is not the URL you're looking for or if you need something else.
@IevgeniiB maybe this is related to this: https://github.com/getsentry/sentry-python/issues/2613 tl;dr: After a certain time the TCP connection is closed. The solution proposed was to create a custom transport that increases the TCP keepalive like mentioned here: https://github.com/getsentry/sentry-python/issues/1198#issuecomment-1802638463 Maybe this can also help you?
@antonpirker thank you for the suggestion! I tested it, and it doesn't change anything for me. I don't think it's related to the https://github.com/getsentry/sentry-python/issues/2613 issue; I do the check-in right after initializing the sentry (for every check-in). This task also runs every time in a new pod.
Thanks for following up @IevgeniiB -- I don't any additional suggestions from the SDK side of things, but our crons folks are taking a look.
Thank you, @sentrivana! Please let me know if I can provide more information to help debug this issue.
Just a quick note: On my side, the behavior stopped after introducing version 1.39.2. (Bear in mind that we might experience different issues here.)
Thanks for following up @SoerenWeber, glad that upgrading has fixed the issue for you! We might be dealing with two different issues but in any case worth a try -- @IevgeniiB could you see if upgrading to 1.39.2 changes anything?
@sentrivana hi! Unfortunately, it didn't improve the stability of my monitor... Thank you for the suggestion!
We've also been experiencing this since we started using cron monitors. All our monitors are Celery/Beat/Django auto-instrumented. There are no other indication that the tasks are actually failing, the error appears to only be in the monitoring itself.
I have mostly been ignoring this and muted all alerts, but since the trial period is over I figured I should report some details and see if it can help you with fixing this. As it is right now the crons service isn't providing much value for us since it cries wolf all the time and I might be tempted to disable it to not pay for that, but before that I should try to help you debug if I can. Let me know if I can provide you with more details.
System
- Python 3.10.4
- Django==4.2.10
- celery==5.3.6
- django-celery-beat==2.5.0
- sentry-sdk==1.40.5
The app lives in a k8s cluster with the webapp and workers as separate pods that can scale and redeploy independently.
I did upgrade sentry-sdk from 1.36 to 1.40 after I read this thread some weeks ago. In the 1 month screenshot below you can see that this changed the behavior slightly, as we stopped getting the yellow "missed" statuses after that. We still get some red "timed out" statuses and some conspicuous holes where there ought to be a status.
Screenshots
24h:
1 month:
@mathiasose Thanks for helping us debug, this is much appreciated!
The failure rates look bad indeed -- how many of those are expected failures and how many are the buggy timeouts? I don't need specific numbers, just a rough idea to understand the scope of the issue.
What schedules and cron durations are we looking at in the two screenshots? Do you see any difference in the failure rates based on the schedule/duration or does it look roughly the same?
Additionally, could you enable debug=True in the SDK init and see whether the check-ins that are later marked as timed out in Sentry are sent from the SDK? The log should say something like this:
[2024-01-04, 16:50:11 EST] {transport.py:421} DEBUG - Sending envelope [envelope with 1 items (check_in)] project:<project id> host:<sentry ingest host>
If this is a network issue (which seems like the most likely explanation to me atm), adjusting the socket options might make a difference. Could you try upgrading to at least 1.41.0 and trying out the second snippet from the release notes?
In our case I would expect more or less 100% green everywhere - I could see a few tasks being missed in cases where they are scheduled to run exactly when we deploy a new revision of the app or something, but mostly all the tasks seem to be running as they should, and cron monitoring is just mistaken in reporting errors.
I created a debug task today and deployed it do our dev environment only. It runs once per minute, sleeps for 10 seconds during execution then returns.
I deployed it with debug=True on sentry-sdk 1.40 first and we had three misses that you can see in the screenshot. Then I deployed the bump to 1.41 and the socket options, and it seems maybe to have improved, but we did get another miss later.
Each miss seems to be accompanied by an exception during the send_envelope request. I notice there is some NewRelic instrumentation that is affecting the http requests, will have to dig a little to see if that could have anything to do with this.
Traceback (most recent call last):
File ""/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py"", line 715, in urlopen
httplib_response = self._make_request(
File ""/usr/local/lib/python3.10/site-packages/newrelic/hooks/external_urllib3.py"", line 32, in _nr_wrapper_make_request_
return wrapped(*args, **kwargs)
File ""/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py"", line 467, in _make_request
six.raise_from(e, None)
File ""<string>"", line 3, in raise_from
File ""/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py"", line 462, in _make_request
httplib_response = conn.getresponse()
File ""/usr/local/lib/python3.10/site-packages/sentry_sdk/integrations/stdlib.py"", line 126, in getresponse
return real_getresponse(self, *args, **kwargs)
File ""/usr/local/lib/python3.10/site-packages/newrelic/hooks/external_httplib.py"", line 77, in httplib_getresponse_wrapper
return wrapped(*args, **kwargs)
File ""/usr/local/lib/python3.10/http/client.py"", line 1374, in getresponse
response.begin()
File ""/usr/local/lib/python3.10/http/client.py"", line 318, in begin
version, status, reason = self._read_status()
File ""/usr/local/lib/python3.10/http/client.py"", line 287, in _read_status
raise RemoteDisconnected(""Remote end closed connection without""
http.client.RemoteDisconnected: Remote end closed connection without response
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File ""/usr/local/lib/python3.10/site-packages/sentry_sdk/transport.py"", line 535, in send_envelope_wrapper
self._send_envelope(envelope)
File ""/usr/local/lib/python3.10/site-packages/sentry_sdk/transport.py"", line 434, in _send_envelope
self._send_request(
File ""/usr/local/lib/python3.10/site-packages/sentry_sdk/transport.py"", line 245, in _send_request
response = self._pool.request(
File ""/usr/local/lib/python3.10/site-packages/urllib3/request.py"", line 81, in request
return self.request_encode_body(
File ""/usr/local/lib/python3.10/site-packages/urllib3/request.py"", line 173, in request_encode_body
return self.urlopen(method, url, **extra_kw)
File ""/usr/local/lib/python3.10/site-packages/urllib3/poolmanager.py"", line 376, in urlopen
response = conn.urlopen(method, u.request_uri, **kw)
File ""/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py"", line 799, in urlopen
retries = retries.increment(
File ""/usr/local/lib/python3.10/site-packages/urllib3/util/retry.py"", line 550, in increment
raise six.reraise(type(error), error, _stacktrace)
File ""/usr/local/lib/python3.10/site-packages/urllib3/packages/six.py"", line 769, in reraise
raise value.with_traceback(tb)
File ""/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py"", line 715, in urlopen
httplib_response = self._make_request(
File ""/usr/local/lib/python3.10/site-packages/newrelic/hooks/external_urllib3.py"", line 32, in _nr_wrapper_make_request_
return wrapped(*args, **kwargs)
File ""/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py"", line 467, in _make_request
six.raise_from(e, None)
File ""<string>"", line 3, in raise_from
File ""/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py"", line 462, in _make_request
httplib_response = conn.getresponse()
File ""/usr/local/lib/python3.10/site-packages/sentry_sdk/integrations/stdlib.py"", line 126, in getresponse
return real_getresponse(self, *args, **kwargs)
File ""/usr/local/lib/python3.10/site-packages/newrelic/hooks/external_httplib.py"", line 77, in httplib_getresponse_wrapper
return wrapped(*args, **kwargs)
File ""/usr/local/lib/python3.10/http/client.py"", line 1374, in getresponse
response.begin()
File ""/usr/local/lib/python3.10/http/client.py"", line 318, in begin
version, status, reason = self._read_status()
File ""/usr/local/lib/python3.10/http/client.py"", line 287, in _read_status
raise RemoteDisconnected(""Remote end closed connection without""
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Update: It does seem much more stable on the dev environment after the 1.41 upgrade with the socket options, so I will be making that upgrade to our staging and prod envs as well.
24 view:
The only task that had errors in that environment post-bump was the debug task that I added, so if not for that I would have read this as the problem being 100% solved. I guess the error happens much less with these connection options and the high frequency of the debug task is what reveals it, but it would eventually happen to other tasks as well.
Thanks @mathiasose. I see two follow-ups for us here: making it easier to enable the alternative connection options (just having a single option, something like enable_keepalive, instead of having to tweak the socket options directly), and investigating further with our server folks if there's something we can do to improve the out of the box experience.
Now that we've had the new settings applied for a few days I wanted to confirm here that things seem to work much better now :pray:
7 day view:
There's been a couple hiccups that I still don't completely understand (especially the longish period of red in dev and production 6-ish days ago), but as long as this is pretty rare still then Crons is a much more helpful tool for us to view now :+1:
These errors might be legitimate application errors that we need to investigate, and now they're not drowning in false alerts. The last row in the screenshot for example was red because of a database configuration issue and Crons reported the failure accurately, which pointed us towards the issue and we made a fix and got the monitor back to green.
Awesome @mathiasose, thanks for following up.
Starting with 1.43.0 you can swap the socket_options for a simple keep_alive=True. It applies the same socket settings in the background.
If you find out more about the remaining hiccups please let us know if they also look like SDK/network issues.