augur Augur "forgets" about a running task

What Happens/symptoms:

a task begins running in augur
something happens
rabbitmq logs show the following error (could also be logged as a warning):

2025-10-14 15:37:05.560429+00:00 [error] <0.1187.0> Error on AMQP connection <0.1187.0> (172.18.0.5:48982 -> 172.18.0.3:5672, vhost: 'augur_vhost', user: 'augur', state: running), channel 0:
2025-10-14 15:37:05.560429+00:00 [error] <0.1187.0>  operation none caused a connection exception connection_forced: "broker forced connection closure with reason 'shutdown'"

the task remains in a "started" state in flower forever. counting flower tasks shows processed = failed + succeeded - 1
according to Augur's DB the task is still running (i.e. SELECT cs.* FROM augur_operations.collection_status AS cs WHERE (core_status <> 'Success') or (secondary_status <> 'Success') or (facade_status <> 'Success') shows atask "collecting"

This seems to affect facade

So far all instances of this have been on rabbitmq 3.12.14 since the fix for #3255 hasnt fully shipped yet

Oct 21 '25 14:10 MoralCode

this discussion on rabbits repo seems to suggest that something about the application code may be preventing rabbit heartbeats for 2 minutes and causing this. unsure if that applies here because we see this anywhere from a few to a dozen times a month or so, so its probably somewhat intermittent. the logs also dont indicate missing heartbeats

Oct 21 '25 14:10 MoralCode

this google groups thread seems to also start drilling down into the "broker forced connection closure" part: https://groups.google.com/g/rabbitmq-users/c/a_zJuidbHic?pli=1

Oct 21 '25 14:10 MoralCode

@sgoggins posted this SQL query in the 10/28 augur meeting as a band-aid/way to fix (by marking as collected) facade tasks that may have been forgotten about if theres a rabbit connection issue (i.e. the tasks keep running and the commit counts match between the clone and what GH says they should be, but because rabbit forgets about it, they maybe dont get marked as completed)

Use with caution

WITH counts AS (
    SELECT c.repo_id, COUNT(DISTINCT c.cmt_commit_hash)::bigint AS commits_captured
    FROM augur_data.commits c
    GROUP BY c.repo_id
),
eligible AS (
    SELECT a.repo_id
    FROM augur_operations.collection_status a
    LEFT JOIN counts cnt ON cnt.repo_id = a.repo_id
    WHERE a.facade_status = 'Collecting'
    AND COALESCE(cnt.commits_captured, 0) >= a.commit_sum
)
UPDATE augur_operations.collection_status cs
SET facade_status = 'Success',
    facade_data_last_collected = NOW(),
    facade_task_id = NULL
FROM eligible e
WHERE cs.repo_id = e.repo_id
RETURNING cs.repo_id;

Nov 11 '25 14:11 MoralCode

I think I found the cause here, and it's also causing #3321.

The issue is in analyze_commits_in_parallel (facade_tasks.py:227-338). This task processes all missing commits in one blocking loop - for large repos, this can take 30+ minutes without yielding control back to Celery. The worker can't send heartbeats during this, so RabbitMQ times out and closes the connection with that "broker forced connection closure" error.

The problem is that when the connection closes, the task stays in "started" state in Flower and the database never gets update from facade_status = 'Collecting'. The on_failure handler doesn't trigger for connection losses, only for actual exceptions.

What makes this worse: augur_collection_monitor counts repos with facade_status = 'Collecting' to check capacity (max 30). So these orphaned tasks make it think workers are busy, and it refuses to schedule new facade tasks. That's why #3321 happens - restart runs clean_collection_status() which clears the orphaned states.

Quick fixes I'd suggest:

Add heartbeat config: broker_heartbeat = 120, task_time_limit = 7200 in celery_app.py
Chunk the commit processing loop to yield every 'X' commits (maybe 500, 1000? not sure what the right config should be here?)
Add a periodic cleanup task (every 15 min) to reset tasks stuck in Collecting for >1 hour

The startup cleanup at backend.py:393 already has the right logic, we just need it running periodically instead of only at restart.

Please correct me if I misunderstood the flow of operations. Happy to discuss further.

Nov 12 '25 17:11 shlokgilda

for large repos, this can take 30+ minutes without yielding control back to Celery.

This is a really interesting idea that I think could be worth testing. I havent personally seen this rabbit "forgetting" issue in a while though.

Nov 12 '25 22:11 MoralCode

After re-looking at the code more carefully - by "forgetting" I meant that when RabbitMQ closes the connection due to missing heartbeats, the task might stay in "started" state in Flower while the database stays at facade_status='Collecting'.

I think the issue might be In analyze_commits_in_parallel (facade_tasks.py:300-332). There's a for loop that processes all missing commits without breaking. There's no broker_heartbeat configured in celery_app.py (defaults to 2 seconds), so my theory is that during long git operations in analyze_commit, the worker can't send heartbeats and RabbitMQ closes the connection.

When the connection closes during execution, I think the task chain breaks. Even if the actual facade work completes (commits get collected), the success handler facade_task_success_util() at the end of the chain might not run, so the database doesn't get updated to facade_status='Success'. This could explain the SQL workaround you posted - repos with all commits collected but stuck in 'Collecting' status.

If you haven't seen this recently though, maybe it's rare enough that it's not worth fixing? Could check if there are any repos currently stuck in facade_status='Collecting' where commits_captured >= commit_sum to see if it's still occurring.

Nov 13 '25 21:11 shlokgilda

Update: I can confirm this is happening on my system. I have 10 repos stuck in facade_status='Collecting' where actual_commits = expected_commits (work is complete) but facade_data_last_collected IS NULL (success handler never ran).

Example: https://github.com/apache/commons-codec | Collecting | 3221 expected | 3221 collected | last_collected: NULL

All 10 have task IDs, so they were marked as started. The commits are in the database, so the facade work completed. But they never got marked as 'Success', which I think matches the theory that the success handler facade_task_success_util() at the end of the task chain never executed.

Here's the SQL query I ran. Could someone verify if my query makes sense?

WITH counts AS (
    SELECT c.repo_id, COUNT(DISTINCT c.cmt_commit_hash)::bigint AS commits_captured
    FROM augur_data.commits c
    GROUP BY c.repo_id
)
SELECT
    r.repo_git,
    a.facade_status,
    a.commit_sum as expected_commits,
    COALESCE(cnt.commits_captured, 0) as actual_commits,
    CASE
        WHEN COALESCE(cnt.commits_captured, 0) >= a.commit_sum THEN 'STUCK (work complete)'
        ELSE 'Still processing'
    END as status,
    a.facade_data_last_collected as last_collected,
    a.facade_task_id
FROM augur_operations.collection_status a
LEFT JOIN counts cnt ON cnt.repo_id = a.repo_id
LEFT JOIN augur_data.repo r ON r.repo_id = a.repo_id
WHERE a.facade_status = 'Collecting'
ORDER BY COALESCE(cnt.commits_captured, 0) >= a.commit_sum DESC, r.repo_git;

Nov 13 '25 22:11 shlokgilda

im no expert here, but that SQL seems reasonable

Nov 14 '25 14:11 MoralCode

Update: I found several facade tasks that were genuinely hung (not just orphaned by RabbitMQ) that match the blocking issue I described earlier. 5 stuck tasks hung for 12+ hours on git checkout.

All were blocking on reading large pack files (up to 684MB) during git checkout operations in git_repo_updates_facade_task. The subprocess never returned, so these weren't RabbitMQ connection timeouts - they were actual infinite waits on git operations.

This is a different failure mode than the "orphaned tasks" I described above. These tasks never completed their work (stuck mid-operation), whereas the orphaned tasks complete but don't get marked as Success. I manually marked them as Error.

Should we maybe adding a timeout parameter to subprocess.run() calls in git operations? Large pack files seem to cause the checkout operation to hang indefinitely.

Nov 19 '25 05:11 shlokgilda

I think I found the root cause of the genuinely hung facade tasks (not the RabbitMQ orphaned tasks).

The 5 facade tasks I mentioned earlier that were stuck for 12+ hours on git checkout were experiencing a classic subprocess deadlock caused by improper pipe handling in the Python code.

Evidence:

Process state analysis showed all 5 git processes with WCHAN=pipe_write (waiting to write to a pipe)
0% CPU usage (blocked on I/O, not actively working)
File descriptor analysis: ls -la /proc/<PID>/fd shows stdout/stderr redirected to pipes with no reader
Git commands on these repos produce 600+ KB of output when run manually (listing modified files on checkout)
Python code uses subprocess.Popen(..., shell=True).wait() without stdout/stderr handling

The deadlock mechanism:

Git checkout produces large output (>64KB) listing modified files
OS pipe buffer fills (typically 64KB on Linux)
Git process blocks on write() waiting for someone to read the pipe
Python parent process blocks in .wait() waiting for git to exit
Neither can proceed → permanent deadlock

Affected code:

augur/tasks/git/util/facade_worker/facade_worker/repofetch.py lines 354-355, 433-434
14 total instances of this pattern across facade git operations

Python documentation confirms this:

"Warning: This will deadlock when using stdout=PIPE and/or stderr=PIPE and the child process generates enough output to a pipe such that it blocks waiting for the OS pipe buffer to accept more data. Use Popen.communicate() to avoid that."

This is the same issue encountered by RobotFramework: https://github.com/robotframework/robotframework/issues/4173.

Proposed fix: Replace subprocess.Popen().wait() with subprocess.run() using proper output handling:

# For operations that don't need output:
try:
    result = subprocess.run(
        cmd, shell=True,
        stdout=subprocess.DEVNULL,
        stderr=subprocess.DEVNULL,
        timeout=300, check=False
    )
    return_code = result.returncode
except subprocess.TimeoutExpired:
    facade_helper.log_activity('Error', f'Git operation timed out: {cmd}')
    return_code = -1

This prevents the deadlock by discarding output instead of buffering it, and adds timeout protection. I will experiment with this change on my deployment and report back.

NOTE: Used Claude Code to write this issue.

Nov 19 '25 06:11 shlokgilda

@shlokgilda I really like this theory - but I'm curious how it would only affect large repos. If we don't plan to actually handle git output we should just send it to devnull, so i think thats definitely worth a PR, It just seems strange that, with only a 64kb buffer, why doesnt this affect all repo clones?

edit: i guess 64k in terms of text is still a lot

Nov 19 '25 16:11 MoralCode

Yeah! I think this fix works. After making those changes, I am actually seeing progress in my analyze_commits_in_parallel function in my facade tasks. If you think this makes sense, I can open a PR for this.

Edit: Re smaller repos: I think with smaller repos, even with 500 files, it shouldn't hit the 64kb buffer. But that's just my assumption, since, as you said, 64k in terms of text is still a lot.

Nov 19 '25 16:11 shlokgilda