[tasks/github] Batch processing for PR review comments collection
Description
- Adds batched processing to
collect_pull_request_review_commentsto reduce memory usage - Processes comments in batches of 1000 instead of accumulating all in memory before insertion
- Combines contributor and comment extraction into a single pass (was two separate loops)
- Extracts shared
_flush_contributorshelper function for code reuse between PR reviews and PR review comments - Adds defensive batch trigger that checks both
pr_review_comment_dictsandcontributorslist sizes to prevent unbounded memory growth in edge cases
Dependencies
- This PR should be merged after #3439 as it builds on that branch
Notes for Reviewers
- Memory impact: For a repo with many PR review comments, old code loaded all comments into memory via
list(github_data_access.paginate_resource(...)). New code streams from the generator and caps batches at ~1000. - Follows the same batching pattern used in
collect_pull_request_reviewsfrom #3439 - The
_flush_contributorshelper is now shared between both PR reviews and PR review comments flush functions for consistency
Testing
- Tested this code with few larger repos (>50K PRs/issues). Works fine.
Signed commits
- [x] Yes, I signed my commits.
AI Disclosure: I used Claude Code to write this PR draft and generate docstrings.
converting to draft because of
This PR should be merged after https://github.com/chaoss/augur/pull/3439 as it builds on that branch
@MoralCode : I really appreciate the safety move of switching this to draft so somebody (me) doesn't merge these in the wrong order. :)
I ran this against tensorflow/tensorflow (75K+ PRs) and a few other large repos. The few things I validated:
- No errors during collection (workers completed successfully)
- Database values matched expected counts (compared PR count vs reviews inserted)
- Memory usage stayed stable (no gradual climb like before)
The dict comprehension part is actually pretty straightforward - it's just the two-pass loop collapsed into one. Instead of looping through reviews twice (once for contributors, once for review data), we do both extractions in the same pass.