augur icon indicating copy to clipboard operation
augur copied to clipboard

Fix: Eliminate redundant full table scans in messages and events collection

Open PredictiveManish opened this issue 2 weeks ago • 6 comments

Description

Moved mapping queries outside batch loops and pass pre-built mappings as parameters to processing functions, following the pattern established in #3439. Solves #3440

Changes Made

augur/tasks/github/messages.py

  • Built issue_url_to_id_map and pr_issue_url_to_id_map once in collect_github_messages() before any batch processing
  • Updated process_messages() to accept mappings as parameters instead of rebuilding them
  • Updated process_large_issue_and_pr_message_collection() to accept and pass mappings
  • Increased batch size from 20 to 1000 (reduces batch overhead)

augur/tasks/github/events.py

  • Built issue_url_to_id_map and pr_url_to_id_map once in BulkGithubEventCollection.collect() before the batch loop
  • Updated _process_events(), _process_issue_events(), and _process_pr_events() to accept mappings as parameters
  • Removed redundant _get_map_from_*() calls from batch processing methods

Performance Improvement

  • Before: 1,000 messages -> 50 full scans of issues AND PRs tables

  • After: 1,000 messages -> 1 full scan of each table (50x reduction)

  • Before: 10,000 events -> 40 full scans total

  • After: 10,000 events → 1 full scan of each table (40x reduction)

This PR fixes #3440

Notes for Reviewers

Signed commits

  • [x] Yes, I signed my commits.

PredictiveManish avatar Dec 06 '25 17:12 PredictiveManish

I just have non-blocking code quality suggestion: use a named constant like MESSAGE_BATCH_SIZE = 500 instead of a magic number

shlokgilda avatar Dec 07 '25 06:12 shlokgilda

I just have non-blocking code quality suggestion: use a named constant like MESSAGE_BATCH_SIZE = 500 instead of a magic number

Yep, agreed

MoralCode avatar Dec 07 '25 18:12 MoralCode

I just have non-blocking code quality suggestion: use a named constant like MESSAGE_BATCH_SIZE = 500 instead of a magic number

What to give the size 500 or 200? as @MoralCode suggested for 200

PredictiveManish avatar Dec 08 '25 16:12 PredictiveManish

I just have non-blocking code quality suggestion: use a named constant like MESSAGE_BATCH_SIZE = 500 instead of a magic number

What to give the size 500 or 200? as @MoralCode suggested for 200

200

sgoggins avatar Dec 09 '25 22:12 sgoggins

@MoralCode / @PredictiveManish : I'm rerunning the failed end to end test. Sometimes GitHub gets overwhelmed and they just timeout.

sgoggins avatar Dec 09 '25 22:12 sgoggins

May we assume you ran this locally and collected data?

Lets not assume - would rather accidentally over-test than not test at all.

MoralCode avatar Dec 11 '25 14:12 MoralCode