Full table scans on every batch in messages and events collection

Open shlokgilda opened this issue 2 weeks ago • 1 comments

Description: Both messages.py and events.py rebuild issue/PR URL mappings on every batch instead of once at the start. This causes redundant full table scans that scale with data volume.

messages.py:156-164 - queries all issues and PRs per 20-message batch:

issues = augur_db.session.query(Issue).filter(Issue.repo_id == repo_id).all()
prs = augur_db.session.query(PullRequest).filter(PullRequest.repo_id == repo_id).all()

events.py:237-253 - rebuilds mappings per 500-event batch via _get_map_from_* methods.

Impact:

1000 messages -> 50 full scans of issues AND PRs tables
10000 events -> 40 full scans total

Expected behavior: Build mappings once before the batch loop, pass as parameters. See #3439 for the expected pattern.

Suggested fix:

Move mapping queries outside batch loops
Build issue_url_to_id_map and pr_url_to_id_map once
Pass mappings to processing functions
Possibly also increase messages batch size from 20 to 1000 (unless this was an intentional design choice that wasn't documented in the code)

Files:

augur/tasks/github/messages.py
augur/tasks/github/events.py

Dec 04 '25 16:12 shlokgilda

i havent noticed this but this sounds like a perfect explaination for some of my observations that processing certain parts of collection that start out reallly fast and slow down over time.

Dec 04 '25 20:12 MoralCode