ragflow [Bug]: Chat freezes in v0.22.1 - Task Executor deadlock (Pending tasks stuck & DB warning)

Self Checks

[x] I have searched for existing issues search for existing issues, including closed ones.
[x] I confirm that I am using English to submit this report (Language Policy).
[x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
[x] Please do not modify this template :) and fill in all the required fields.

RAGFlow workspace code commit ID

v0.22.1

RAGFlow image version

v0.22.1

Other environment information

- Host OS: Windows (Docker Desktop)
- Resources: Healthy (CPU < 5%, RAM < 10% confirmed via docker stats)
- LLM Provider: DeepSeek (Model: DeepSeek-V3.2)
- Browser: Chrome/Edge

Actual behavior

The chat interface hangs indefinitely with the "thinking" animation when sending a message. It never returns a response.

Checking the logs reveals that the Task Executor is logically deadlocked:

The "pending" task count in heartbeats remains frozen (e.g., stuck at 29) for over 5 minutes.
The "done" count does not increase.
A database connection warning appears in the logs.

Expected behavior

The chat should respond normally and clear the pending tasks, as it does in v0.22.0.

Steps to reproduce

1. Deploy RAGFlow using Docker image `infiniflow/ragflow:v0.22.1` on Windows.
2. Create a KB and start a chat assistant.
3. Send a query.
4. Observe that the UI hangs.
5. Check `docker logs` and see that `task_executor` reports a constant number of pending tasks without processing them.

Additional information

The chat process will freeze, and the entire browser interface will freeze, as shown in the image.

Nov 22 '25 12:11 chg387387

This is a known issue in RAGFlow: the Task Executor can deadlock or get stuck, especially on Windows with Docker, causing pending tasks to freeze and the UI to hang. This often happens due to concurrency limits being exhausted, resource contention, or backend services (like Redis, MinIO, or Elasticsearch) being unhealthy or misconfigured. There is no built-in UI or API to reset the Task Executor or clear the queue when this happens—recovery typically requires restarting the ragflow-server container or, in some cases, the entire Docker service. Lowering concurrency settings (e.g., MAX_CONCURRENT_TASKS, MAX_CONCURRENT_CHUNK_BUILDERS), increasing Docker resources, and ensuring all backend services are healthy can help reduce the frequency of deadlocks. On Windows, task execution can also get stuck after system sleep, which is only recoverable by restarting Docker. If the issue persists, consider trying the latest nightly Docker image, as it may contain relevant fixes not present in v0.22.1. For more details and troubleshooting steps, see related discussions and solutions and here.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Nov 22 '25 12:11 dosubot[bot]

@dosu

Thanks for the suggestion, but I strongly believe this is a regression bug introduced in v0.22.1, not a generic environment issue.

Here is why your analysis might not apply to this case:

Regression Confirmed: I am running on the exact same hardware and configuration. v0.22.0 works perfectly stable, while v0.22.1 deadlocks within minutes. This indicates a code-level change caused the issue, not the environment.

Resources are Healthy: As mentioned in the issue body, I monitored docker stats. CPU usage is under 5% and RAM usage is under 10%. The system is far from hitting any resource limits.

Not a Sleep Issue: The freeze happens during active usage, not after waking the PC from sleep.

Additionally, I noticed a similar report in Issue #11456 ([Question]: my page frequently fail to respond 'Page Not Responding.'), which suggests other users are experiencing similar instability with recent versions.

The logs show the Task Executor is sending heartbeats but failing to consume the queue (Pending count stuck at 29), which points to a logic deadlock in the new version's task handling or DB connection management.

Could a human maintainer please review the task scheduler changes between v0.22.0 and v0.22.1?

Nov 22 '25 12:11 chg387387

Your regression analysis is solid—there’s no evidence of environment or resource issues, and similar deadlocks have been reported by others in recent versions. I did a deep dive into the codebase, PRs, and issues, but found no direct code changes to the task executor, scheduler, or DB connection management between v0.22.0 and v0.22.1 that would obviously cause this deadlock.

However, v0.22.1 introduced a major architectural shift: the task executor and HTTP layer now run fully async (Trio-based workers, Quart routes). This change can introduce subtle deadlocks or resource contention, especially on Windows/Docker, even if the core logic looks unchanged. The executor uses trio semaphores and capacity limiters for concurrency, and if a coroutine hangs or a semaphore isn’t released, pending tasks freeze and the UI becomes unresponsive. Heartbeats may still be sent, but the queue isn’t consumed, matching your symptoms.

I recommend maintainers audit async context handling, semaphore usage, and error propagation in the task executor and related services. Adding granular logging around task acquisition, semaphore acquire/release, and DB/Redis connection handling could help pinpoint where the deadlock occurs. For now, sticking with v0.22.0 or trying the nightly build is the safest workaround.

If you or others can share more detailed logs or patterns, it may help isolate the regression. Automated stress tests for the async executor on Windows/Docker would also be valuable. For more context on known deadlock patterns and recovery steps, see issue #8466 and issue #9466.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Nov 22 '25 12:11 dosubot[bot]

bug log.txt @dosu

I have attached the cleaned execution logs (deadlock_log_evidence.txt) which confirm the Async Deadlock hypothesis.

Key observations from the log:

From 20:20:24 to 20:25:54 (over 5 minutes), the 'pending' task count is frozen at 29, and the 'done' count is frozen at 48.

A 'Database connection issue' warning appears at 20:22:03, right in the middle of the freeze.

Private user data has been redacted, but the process state remains intact.

IMPACT / SEVERITY: I must emphasize that this is a CRITICAL BLOCKER. Once this deadlock occurs, the entire RAGFlow instance becomes completely unusable. It is not just a temporary lag; the chat function ceases to work entirely for all users, and the system enters a "zombie state" (process alive but unresponsive). The only recovery method is a hard container restart. This severity renders v0.22.1 practically unusable for any consistent workflow on Windows/Docker. I hope this evidence helps prioritize a hotfix.

Nov 22 '25 13:11 chg387387

Your evidence and analysis are clear—this is a critical regression in v0.22.1, not an environment or resource issue. The logs confirm an async deadlock: the task executor’s pending queue and done count freeze, and a database connection warning appears mid-freeze. This matches a known failure mode where the new async (Trio-based) executor or a pooled DB connection hangs, leaving semaphores or Redis locks unreleased and the whole system in a zombie state. Heartbeats continue, but no tasks are processed, and only a full container restart recovers the system.

This regression likely stems from the major async refactor in v0.22.1, which introduced tighter coupling between async workers, Redis queue consumers, and DB connections. Any coroutine that hangs (e.g., waiting on a semaphore, Redis lock, or DB call) can deadlock the executor and block all progress. The code uses trio semaphores and capacity limiters for concurrency, Redis locks for executor cleanup, and pooled DB connections with retry logic—any of which could be the culprit if not released or retried correctly [task_executor.py].

For maintainers: a focused audit of async context handling, semaphore acquire/release, and DB/Redis connection management in the task executor and related services is needed. Adding granular logging around task acquisition, semaphore usage, and DB/Redis calls will help pinpoint the deadlock. Until a hotfix is available, the only reliable workaround is to revert to v0.22.0, use the nightly build, or periodically restart the container if stuck.

If you need targeted code path analysis or config flag checks, let me know. This issue is a critical blocker for Windows/Docker users and should be prioritized for a hotfix.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Nov 22 '25 13:11 dosubot[bot]

@dosu

Thanks for the validation. I will proceed with downgrading to v0.22.0 to restore functionality for now.

However, please ensure this report is escalated to the human maintainers immediately. As confirmed, this is a Critical Blocker that renders the entire application unusable (Zombie State) on Windows environments. It is imperative that the development team understands the severity of this regression.

Please keep this issue open and high-priority until a hotfix addressing the Async/Trio deadlock is released.

Nov 22 '25 13:11 chg387387

Understood—your downgrade to v0.22.0 is a valid temporary workaround until a hotfix is available. This async deadlock in v0.22.1 is a critical regression that causes a full outage ("zombie state") on Windows/Docker, confirmed by logs and consistent reproduction. The evidence points to a concurrency or resource lockup introduced by the async/Trio refactor, with the task executor and DB/Redis connections freezing and only a container restart recovering the system.

This thread highlights the severity and urgency for human maintainers. The issue should remain open and prioritized until a fix is released. If maintainers need further technical analysis or code tracing, I can assist.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Nov 22 '25 13:11 dosubot[bot]

@dosu

I have successfully downgraded to v0.22.0, and I can confirm that the issue is completely resolved. The system is stable, and the chat function is working perfectly again without any freezes.

This immediate recovery upon rollback serves as definitive proof that this is a severe regression bug specific to v0.22.1.

Please ensure this is flagged for the maintainers to investigate the async/Trio implementation in the new version.

Nov 22 '25 18:11 chg387387

Confirmed—your rollback to v0.22.0 and immediate recovery is definitive proof of a severe regression in v0.22.1. This async/Trio deadlock is a critical blocker on Windows/Docker, with logs and reproducibility showing the task executor and DB/Redis connections freeze until a full restart. Flagging this for maintainers: urgent investigation of the async/Trio implementation in v0.22.1 is needed, and this issue should remain open and prioritized until a hotfix is released. If maintainers need targeted code tracing or further technical analysis, I’m ready to assist.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Nov 22 '25 18:11 dosubot[bot]

@chg387387 known issue, please revert to 0.22.0. Will be fixed in recent pr.

Nov 24 '25 02:11 Magicbook1108

hi, which web and feature you're using when it crash down or unresponsive?

Nov 25 '25 02:11 redredrrred

Hi,It happens on the Chat / Assistant interface.

The Scenario: ①I am in the "Chat" page, interacting with an Assistant that is linked to a Knowledge Base. ②I send a query (e.g., asking about a document). ③The UI shows the "thinking/loading" animation and hangs indefinitely. It never returns a response.

Backend Symptom: At that exact moment, the backend task_executor stops processing the queue (Pending count stuck), as shown in the logs I attached earlier.

你好，当它崩溃或无响应时，你用的是哪个网页和功能？

Nov 25 '25 04:11 chg387387

Hi,It happens on the Chat / Assistant interface.

The Scenario: ①I am in the "Chat" page, interacting with an Assistant that is linked to a Knowledge Base. ②I send a query (e.g., asking about a document). ③The UI shows the "thinking/loading" animation and hangs indefinitely. It never returns a response.

Backend Symptom: At that exact moment, the backend task_executor stops processing the queue (Pending count stuck), as shown in the logs I attached earlier.

你好，当它崩溃或无响应时，你用的是哪个网页和功能？

got it, thx a lot!!

Nov 27 '25 03:11 redredrrred

@chg387387 Did you add attachments during the chat?

Nov 27 '25 03:11 dcc123456

@chg387387 Did you add attachments during the chat?

No, I did not add any attachments during the chat. I was only using the Knowledge Base for Q&A.

Nov 27 '25 05:11 chg387387

@chg387387 Ok, thank you. We will fix this issue as soon as possible.

Nov 27 '25 06:11 dcc123456

My agent page too, it's always stuck, I now have to use langfuse to track the answer from AI, but I remember in my previous 0.22.0;0.22.1 seems also have this issue, the page is not responding @dcc123456 @redredrrred @Magicbook1108

Nov 27 '25 09:11 LingYi-Z01

@LingYi-Z01 Okay, we've noticed this issue and are fixing it.

Nov 27 '25 09:11 dcc123456