Full table scans on every batch in messages and events collection
Description:
Both messages.py and events.py rebuild issue/PR URL mappings on every batch instead of once at the start. This causes redundant full table scans that scale with data volume.
messages.py:156-164 - queries all issues and PRs per 20-message batch:
issues = augur_db.session.query(Issue).filter(Issue.repo_id == repo_id).all()
prs = augur_db.session.query(PullRequest).filter(PullRequest.repo_id == repo_id).all()
events.py:237-253 - rebuilds mappings per 500-event batch via _get_map_from_* methods.
Impact:
- 1000 messages -> 50 full scans of issues AND PRs tables
- 10000 events -> 40 full scans total
Expected behavior: Build mappings once before the batch loop, pass as parameters. See #3439 for the expected pattern.
Suggested fix:
- Move mapping queries outside batch loops
- Build
issue_url_to_id_mapandpr_url_to_id_maponce - Pass mappings to processing functions
- Possibly also increase messages batch size from 20 to 1000 (unless this was an intentional design choice that wasn't documented in the code)
Files:
- augur/tasks/github/messages.py
- augur/tasks/github/events.py
i havent noticed this but this sounds like a perfect explaination for some of my observations that processing certain parts of collection that start out reallly fast and slow down over time.