sentry-ruby
sentry-ruby copied to clipboard
sentry-sidekiq reports exceptions for jobs with `dead: false` when `report_after_job_retries` is true
Issue Description
We have set up sentry-sidekiq with report_after_job_retries = true.
We have a job which we want to retry, but where we don't care if it ultimately fails, so we have configured that job with sidekiq_options dead: false to prevent it from going to the sidekiq dead set.
Exceptions from this job do not hit sentry while it is retrying (good), but we do still get a sentry exception for a dead job when it exhausts its retries (bad), even though the job never actually hits the dead set.
Arguably report_after_job_retries is behaving as intended, since the job has exhausted its retries in this case; I can see the case that this is really a feature request for a new flag (only_report_dead_jobs or something). But IMO the fact that the exception comes in to Sentry with SidekiqJob.dead = True is a bug, because the job is very definitely not hitting the dead set.
Reproduction Steps
- configure sentry-sidekiq with
report_after_job_retries = true - create a sidekiq job with
dead: falsethat raises an exception - execute that job, let it exhaust its retries
- see the sentry exception about a dead job
Expected Behavior
As discussed in the issue description, there are a couple of possible behaviours here that could make sense, but I don't think the current one does.
Actual Behavior
sentry-sidekiq sends dead-job exceptions for jobs that skip the dead set
Ruby Version
3.2.5
SDK Version
5.18.1
Integration and Its Version
Sidekiq 7.3.0
Sentry Config
No response
@eapache-opslevel will take a look at this when I have time. Do you have a suggested fix since you've already given it some thought?
I have a couple of thoughts here.
The simplest change would be to add another condition to the code around https://github.com/getsentry/sentry-ruby/blob/8c1c259f7d288052e90dea15e3aa8b48c5c21c20/sentry-sidekiq/lib/sentry/sidekiq/error_handler.rb#L22-L30 such that no exception is reported when report_after_job_retries is true and the job would skip the dead set, even if it has exhausted its retries. That's the behaviour I had intuitively expected, but it is also a potentially dangerous behavioural change for anybody relying on the current behaviour.
Safer would be to add another config option called something like report_jobs_that_skip_dead_set (defaults to true to maintain current behaviour), where setting it to false would suppress errors that come in from jobs configured with dead: false. Though how this interacts with report_after_job_retries is a potentially interesting question.
Not sure which of these seems more correct to you, or whether there's a third option that gives the best of both worlds.
To keep the semantics simple, I've added a new boolean report_only_dead_jobs that will ignore jobs with dead: false completely.