Fix: Eliminate redundant full table scans in messages and events collection
Description
Moved mapping queries outside batch loops and pass pre-built mappings as parameters to processing functions, following the pattern established in #3439. Solves #3440
Changes Made
augur/tasks/github/messages.py
- Built
issue_url_to_id_mapandpr_issue_url_to_id_maponce incollect_github_messages()before any batch processing - Updated
process_messages()to accept mappings as parameters instead of rebuilding them - Updated
process_large_issue_and_pr_message_collection()to accept and pass mappings - Increased batch size from 20 to 1000 (reduces batch overhead)
augur/tasks/github/events.py
- Built
issue_url_to_id_mapandpr_url_to_id_maponce inBulkGithubEventCollection.collect()before the batch loop - Updated
_process_events(),_process_issue_events(), and_process_pr_events()to accept mappings as parameters - Removed redundant
_get_map_from_*()calls from batch processing methods
Performance Improvement
-
Before: 1,000 messages -> 50 full scans of issues AND PRs tables
-
After: 1,000 messages -> 1 full scan of each table (50x reduction)
-
Before: 10,000 events -> 40 full scans total
-
After: 10,000 events → 1 full scan of each table (40x reduction)
This PR fixes #3440
Notes for Reviewers
Signed commits
- [x] Yes, I signed my commits.
I just have non-blocking code quality suggestion: use a named constant like MESSAGE_BATCH_SIZE = 500 instead of a magic number
I just have non-blocking code quality suggestion: use a named constant like
MESSAGE_BATCH_SIZE = 500instead of a magic number
Yep, agreed
I just have non-blocking code quality suggestion: use a named constant like
MESSAGE_BATCH_SIZE = 500instead of a magic number
What to give the size 500 or 200? as @MoralCode suggested for 200
I just have non-blocking code quality suggestion: use a named constant like
MESSAGE_BATCH_SIZE = 500instead of a magic numberWhat to give the size 500 or 200? as @MoralCode suggested for 200
200
@MoralCode / @PredictiveManish : I'm rerunning the failed end to end test. Sometimes GitHub gets overwhelmed and they just timeout.
May we assume you ran this locally and collected data?
Lets not assume - would rather accidentally over-test than not test at all.