augur icon indicating copy to clipboard operation
augur copied to clipboard

[tasks/github] Batch processing for PR review comments collection

Open shlokgilda opened this issue 2 weeks ago • 2 comments

Description

  • Adds batched processing to collect_pull_request_review_comments to reduce memory usage
  • Processes comments in batches of 1000 instead of accumulating all in memory before insertion
  • Combines contributor and comment extraction into a single pass (was two separate loops)
  • Extracts shared _flush_contributors helper function for code reuse between PR reviews and PR review comments
  • Adds defensive batch trigger that checks both pr_review_comment_dicts and contributors list sizes to prevent unbounded memory growth in edge cases

Dependencies

  • This PR should be merged after #3439 as it builds on that branch

Notes for Reviewers

  • Memory impact: For a repo with many PR review comments, old code loaded all comments into memory via list(github_data_access.paginate_resource(...)). New code streams from the generator and caps batches at ~1000.
  • Follows the same batching pattern used in collect_pull_request_reviews from #3439
  • The _flush_contributors helper is now shared between both PR reviews and PR review comments flush functions for consistency

Testing

  • Tested this code with few larger repos (>50K PRs/issues). Works fine.

Signed commits

  • [x] Yes, I signed my commits.

AI Disclosure: I used Claude Code to write this PR draft and generate docstrings.

shlokgilda avatar Dec 09 '25 19:12 shlokgilda

converting to draft because of

This PR should be merged after https://github.com/chaoss/augur/pull/3439 as it builds on that branch

MoralCode avatar Dec 09 '25 21:12 MoralCode

@MoralCode : I really appreciate the safety move of switching this to draft so somebody (me) doesn't merge these in the wrong order. :)

sgoggins avatar Dec 09 '25 22:12 sgoggins

I ran this against tensorflow/tensorflow (75K+ PRs) and a few other large repos. The few things I validated:

  1. No errors during collection (workers completed successfully)
  2. Database values matched expected counts (compared PR count vs reviews inserted)
  3. Memory usage stayed stable (no gradual climb like before)

The dict comprehension part is actually pretty straightforward - it's just the two-pass loop collapsed into one. Instead of looping through reviews twice (once for contributors, once for review data), we do both extractions in the same pass.

shlokgilda avatar Dec 20 '25 16:12 shlokgilda