[tasks] Fix memory issues from large in-memory lists in collection tasks
Description This PR addresses memory issues when collecting data from repositories with large datasets (10,000+ issues/PRs/contributors). Fixes #3404.
Key Changes:
- Generator pattern for issues: Prevents loading all issues into memory at once
- Batch processing: Insert data in 1000-item batches across all collection tasks
.clear()over reassignment: Reuses list objects instead of creating new ones, reducing GC pressure- Move inserts outside loops: In PR reviews, contributors and reviews are already in memory, so batching the final insert is safe and more efficient
Notes for Reviewers
- All changes maintain existing logic—only optimization for memory efficiency
- Batch size of 1000 balances memory usage vs. database round trips
- PR reviews refactor moves inserts outside the loop: reduces N database operations to 1 bulk insert (safe since
all_pr_reviewsis already in memory)
Signed commits
- [x] Yes, I signed my commits.
GenAI Disclosure: Claude Code was used to generate this PR draft and review diff changes for logical correctness and potential performance issues.
It's a slightly longer PR, but I think important and within scope since all these issues were possible causes of OOM exceptions.
@shlokgilda has been thoroughly testing this and has confirmed that the facade workers are flowing and secondary is not memory bottlenecked anymore.
I trust this as far as testing goes and am going to mark this as ready
I think I was the one who removed that import because pylint suggested it was unused. I haven't personally tested that change so prob worth reverting just in case.
rebased, fixed the merge conflict with the string fields fix (#3434) and corrected my pylint bug