Dont store potentially large in-memory lists of all items in `process_pull_request_commits` and `process_pull_request_files`
I'm observing a memory spike when starting up augur that is so large it is causing celery to get OOM killed (albeit by a userland script i have installed).
After analyzing the logs with the help of generative AI, the tool directed me towards several instances of tasks that showed up frequently in the logs and used coding patterns that stored what could be large volumes of data in logs, for example:
in process_pull_request_commits:
https://github.com/chaoss/augur/blob/main/augur/tasks/github/pull_requests/commits_model/core.py#L47
in process_pull_request_files:
https://github.com/chaoss/augur/blob/main/augur/tasks/github/pull_requests/files_model/core.py#L43
While im not certain that this is the cause of the memory issues, I think this could use refactoring to batch and inserted into the DB in groups of predefined size, rather than stored and inserted all at once.
I wonder if this might also help with some of the DB contention identified in #3343
Another large in memory list causing actual problems (also theres one about a couple dozen lines below this in the same function): https://github.com/chaoss/augur/blob/d704670fc80db2397e6f20912ad89f6376aaf372/augur/tasks/github/facade_github/tasks.py#L255
Hi @MoralCode I would like to work on this issue as my first contribution to Augur.
I have confirmed that both process_pull_request_commits and process_pull_request_files accumulate potentially very large in-memory lists before inserting into the database, which can lead to high RAM usage and OOM conditions when processing large repositories.
My proposed fix:
- Refactor these functions to insert rows in batches (e.g., 200-500 items at a time)
- Avoid storing the full dataset in memory at once
- Use the existing DB insertion utilities so we don’t break any schema or logging patterns
- Test against a repo with large PRs to confirm memory improvements
Before I proceed, I have two quick questions:
- Is there an existing batch insert helper in Augur that I should reuse, or should I implement a small batching loop manually?
- Is there a preferred batch size for inserts, or should I start with ~200 and make it configurable?
If this approach sounds good, I will start working on a PR.
Thanks!
@Theminacious how much python experience do you have? This is probably somewhat more challenging as a first issue.
I'd recommended starting with picking just one of the tasks to start with. Once the code passes review then you'll know what to expect if you want to do the other one.
As far as batch size, I was thinking something like 500 or mayyybe 1000 since I suspect this is only mostly a problem with large repos.
I probably wouldnt worry about making it configurable for now.
@MoralCode Thanks for the guidance!
I’ll start by refactoring just one of the tasks — process_pull_request_commits — and keep the batch size around 500 as suggested. Once that passes review and everything looks good, I can apply the same approach to process_pull_request_files in a follow-up PR.
I won’t make the batch size configurable for now to keep the initial change focused and easier to review.
I’ll begin working on this and will open a draft PR once I have the batching logic in place. 👍