sentry icon indicating copy to clipboard operation
sentry copied to clipboard

feat(aci): Add error DetectorGroup chunked backfill task and method

Open kcons opened this issue 1 month ago • 4 comments

The existing DetectorGroup backfill job is impractically slow. This adds a function (intended to be triggered by a job) to produce roughly equal ranges of IDs in the Projects table, which then will be used to trigger a new task that backfills the projects in that range.

This distributes all of the slow bits into chunks we can control the size of, and the processing pool used to execute them can be gradually dialed up as we gain confidence in correctness and capacity cost. The expectation is that this should allow backfill to finish completely in a day or so without blocking any jobs or hand-holding.

kcons avatar Dec 04 '25 05:12 kcons

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff            @@
##           master   #104377   +/-   ##
========================================
  Coverage   80.57%    80.57%           
========================================
  Files        9345      9345           
  Lines      399518    399518           
  Branches    25600     25600           
========================================
  Hits       321894    321894           
  Misses      77171     77171           
  Partials      453       453           

codecov[bot] avatar Dec 04 '25 05:12 codecov[bot]

I don't know too much about this task, but is there any reason we can't use RangeQuerySetWrapper to iterate all projects and fire a task per project, or chunk of projects?

Similar to

https://github.com/getsentry/sentry/blob/f8a9b059236b18dd8892820ce8429592998ada34/src/sentry/tasks/weekly_escalating_forecast.py#L62-L74

Unless I'm misunderstanding the question, that is roughly what we're trying to set up here. get_project_id_ranges_for_backfill is intended to be run from a Job to pick project ranges to trigger backfill_error_detector_groups with, and that task processes the detectors for this chunk of projects. I was initially doing a task per project, but it's not too much harder to chunk, and chunking should let us schedule and process an order of magnitude fewer tasks.

kcons avatar Dec 05 '25 22:12 kcons

I don't know too much about this task, but is there any reason we can't use RangeQuerySetWrapper to iterate all projects and fire a task per project, or chunk of projects? Similar to https://github.com/getsentry/sentry/blob/f8a9b059236b18dd8892820ce8429592998ada34/src/sentry/tasks/weekly_escalating_forecast.py#L62-L74

Unless I'm misunderstanding the question, that is roughly what we're trying to set up here. get_project_id_ranges_for_backfill is intended to be run from a Job to pick project ranges to trigger backfill_error_detector_groups with, and that task processes the detectors for this chunk of projects. I was initially doing a task per project, but it's not too much harder to chunk, and chunking should let us schedule and process an order of magnitude fewer tasks.

Right, I was mostly wondering if we needed the custom sql that we have there, or can we follow the existing patterns we use elsewhere in the codebase? Just generally when I see raw sql I want to avoid it if possible.

I don't mind too much whether we chunk or do individual tasks. We should be able to control the concurrency of the queue so it shouldn't be too much of a problem either way

wedamija avatar Dec 05 '25 22:12 wedamija

I don't know too much about this task, but is there any reason we can't use RangeQuerySetWrapper to iterate all projects and fire a task per project, or chunk of projects? Similar to https://github.com/getsentry/sentry/blob/f8a9b059236b18dd8892820ce8429592998ada34/src/sentry/tasks/weekly_escalating_forecast.py#L62-L74

Unless I'm misunderstanding the question, that is roughly what we're trying to set up here. get_project_id_ranges_for_backfill is intended to be run from a Job to pick project ranges to trigger backfill_error_detector_groups with, and that task processes the detectors for this chunk of projects. I was initially doing a task per project, but it's not too much harder to chunk, and chunking should let us schedule and process an order of magnitude fewer tasks.

Right, I was mostly wondering if we needed the custom sql that we have there, or can we follow the existing patterns we use elsewhere in the codebase? Just generally when I see raw sql I want to avoid it if possible.

I don't mind too much whether we chunk or do individual tasks. We should be able to control the concurrency of the queue so it shouldn't be too much of a problem either way

Ah, I gotcha. Yeah, it's not really necessary. It just seemed like an efficient and easy way to chunk the id space. I can just drop the fuction and plan on having the job chunk in Python; I don't expect the perf difference to be meaningful.

kcons avatar Dec 05 '25 23:12 kcons