feat(aci): Add error DetectorGroup chunked backfill task and method
The existing DetectorGroup backfill job is impractically slow. This adds a function (intended to be triggered by a job) to produce roughly equal ranges of IDs in the Projects table, which then will be used to trigger a new task that backfills the projects in that range.
This distributes all of the slow bits into chunks we can control the size of, and the processing pool used to execute them can be gradually dialed up as we gain confidence in correctness and capacity cost. The expectation is that this should allow backfill to finish completely in a day or so without blocking any jobs or hand-holding.
Codecov Report
:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: All tests successful. No failed tests found.
Additional details and impacted files
@@ Coverage Diff @@
## master #104377 +/- ##
========================================
Coverage 80.57% 80.57%
========================================
Files 9345 9345
Lines 399518 399518
Branches 25600 25600
========================================
Hits 321894 321894
Misses 77171 77171
Partials 453 453
I don't know too much about this task, but is there any reason we can't use
RangeQuerySetWrapperto iterate all projects and fire a task per project, or chunk of projects?Similar to
https://github.com/getsentry/sentry/blob/f8a9b059236b18dd8892820ce8429592998ada34/src/sentry/tasks/weekly_escalating_forecast.py#L62-L74
Unless I'm misunderstanding the question, that is roughly what we're trying to set up here.
get_project_id_ranges_for_backfill is intended to be run from a Job to pick project ranges to trigger backfill_error_detector_groups with, and that task processes the detectors for this chunk of projects.
I was initially doing a task per project, but it's not too much harder to chunk, and chunking should let us schedule and process an order of magnitude fewer tasks.
I don't know too much about this task, but is there any reason we can't use
RangeQuerySetWrapperto iterate all projects and fire a task per project, or chunk of projects? Similar to https://github.com/getsentry/sentry/blob/f8a9b059236b18dd8892820ce8429592998ada34/src/sentry/tasks/weekly_escalating_forecast.py#L62-L74Unless I'm misunderstanding the question, that is roughly what we're trying to set up here.
get_project_id_ranges_for_backfillis intended to be run from a Job to pick project ranges to triggerbackfill_error_detector_groupswith, and that task processes the detectors for this chunk of projects. I was initially doing a task per project, but it's not too much harder to chunk, and chunking should let us schedule and process an order of magnitude fewer tasks.
Right, I was mostly wondering if we needed the custom sql that we have there, or can we follow the existing patterns we use elsewhere in the codebase? Just generally when I see raw sql I want to avoid it if possible.
I don't mind too much whether we chunk or do individual tasks. We should be able to control the concurrency of the queue so it shouldn't be too much of a problem either way
I don't know too much about this task, but is there any reason we can't use
RangeQuerySetWrapperto iterate all projects and fire a task per project, or chunk of projects? Similar to https://github.com/getsentry/sentry/blob/f8a9b059236b18dd8892820ce8429592998ada34/src/sentry/tasks/weekly_escalating_forecast.py#L62-L74Unless I'm misunderstanding the question, that is roughly what we're trying to set up here.
get_project_id_ranges_for_backfillis intended to be run from a Job to pick project ranges to triggerbackfill_error_detector_groupswith, and that task processes the detectors for this chunk of projects. I was initially doing a task per project, but it's not too much harder to chunk, and chunking should let us schedule and process an order of magnitude fewer tasks.Right, I was mostly wondering if we needed the custom sql that we have there, or can we follow the existing patterns we use elsewhere in the codebase? Just generally when I see raw sql I want to avoid it if possible.
I don't mind too much whether we chunk or do individual tasks. We should be able to control the concurrency of the queue so it shouldn't be too much of a problem either way
Ah, I gotcha. Yeah, it's not really necessary. It just seemed like an efficient and easy way to chunk the id space. I can just drop the fuction and plan on having the job chunk in Python; I don't expect the perf difference to be meaningful.