dify icon indicating copy to clipboard operation
dify copied to clipboard

`llm.invoke(..., stream=True)` returns empty assistant chunks when prompt history is long

Open akriaueno opened this issue 7 months ago • 4 comments

Self Checks

  • [x] This is only for bug report, if you would like to ask a question, please head to Discussions.
  • [x] I have searched for existing issues search for existing issues, including closed ones.
  • [x] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [x] [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • [x] Please do not modify this template :) and fill in all the required fields.

Dify version

1.4.0

Cloud or Self Hosted

Cloud

Steps to reproduce

(Background: We first hit this bug while writing an MCP client plugin.
To isolate it, the repo below ships a minimal mock plugin that returns an HTML snapshot of ≈ 30–60 k tokens, so you can reproduce the problem without the real MCP stack.)

  1. Clone the sample plugin

    git clone https://github.com/urth-inc/dify-huge-tool-call-sim.git
    cd dify-huge-tool-call-sim
    git checkout develop
    uv sync            # or: pip install -r requirements.txt
    
  2. Install the plugin on Dify Cloud

    1. In the Dify UI, go to Plugins → Debug (bug icon) → Copy Key.
    2. Copy .env.example to .env and replace REMOTE_INSTALL_KEY with the key you copied.
    3. Start the plugin:
      uv run python -m main   # or: python -m main
      
  3. Create a workflow that uses this agent-strategy plugin

    • Model: GPT-4o (Azure OpenAI) or Gemini 2.0 Flash 1.0
    • No other nodes are required; the plugin generates the long history itself.
  4. Run the workflow several times

    • The debug console frequently stalls after the first one or two characters, or shows no response at all clear evidence that empty chunks are being emitted.
    • In my test runs: trial 1st and 3rd returned no content, whereas trial 2nd returned the expected text.
    • Execution example:
      screenshot

Token-limit hypothesis ruled out

Capturing the same request through Cloudflare AI Gateway shows the full answer arriving from GPT-4o / Gemini, so the truncation happens inside Dify rather than at the model provider or due to context-window limits.

✔️ Expected Behavior

self.session.model.llm.invoke(..., stream=True) should always yield at least one chunk whose
chunk.delta.message.content is non-empty.
With the minimal loop below, response must not remain an empty string:

chunks: Generator[LLMResultChunk, None, None] | LLMResult = (
    self.session.model.llm.invoke(
        prompt_messages=prompt_history,
        model_config=LLMModelConfig(**params.model.model_dump(mode="json")),
        tools=prompt_message_tools,
        stream=True,
    )
)

response = ""

logger.info(f"🐛<type of chunks: {type(chunks)}")
for chunk in chunks: 
    response += chunk.delta.message.content

logger.info(f"🐛<response: {response}") # should contain text

yield self.create_text_message(text=f"{response}")

Current behavior: under the conditions described in Steps to Reproduce, the loop finishes without adding any content, leaving response empty.

❌ Actual Behavior

Under the conditions in Steps to Reproduce, the generator sometimes emits only 0–2 chunks, and those chunks have an empty content.
Consequently response stays empty and the workflow terminates without any assistant text.

response = ""

logger.info(f"🐛<type of chunks: {type(chunks)}")
for chunk in chunks: 
    response += chunk.delta.message.content

logger.info(f"🐛<response: {response}") # no content

akriaueno avatar May 20 '25 06:05 akriaueno

Hello @akriaueno! I'm Dosu, your friendly bot here to help you solve bugs, answer questions, and guide you in becoming a contributor. I'll be around while you wait for a human maintainer. Let's tackle this together! 🤖

I found a similar resolved issue that might be relevant to your problem. The issue titled "LLM response text will be truncated after a long request" was closed by a pull request that added read and write timeout configurations [1]. This might suggest that adjusting timeout settings could potentially address the issue of empty chunks being emitted during streaming operations.

Additionally, Dify's error handling mechanisms for streaming operations could lead to empty chunks being emitted. Exceptions during streaming are output as stream events with an event: error, and when an error event is received, the stream ends [2][3]. This structured error handling ensures that exceptions are communicated clearly, which might help in diagnosing the root cause of the issue you're experiencing.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

dosubot[bot] avatar May 20 '25 06:05 dosubot[bot]

This issue exists in version 1.3.0 and persists after upgrading to version 1.4.0, causing the MCP to not function properly when the historical content is lengthy. However, the LLM actually produces output.

sunli1223 avatar May 21 '25 05:05 sunli1223

Change everything timeout related in the .env file for docker compose, append zeros to these lines and you shall be good.

To view all lines timeout related:

cd docker
cat .env | grep -E 'TIMEOUT|TIME_OUT'

To append ten zeros to all timeout related lines in .env:

cd docker

cp .env .env.bak

# For Linux (GNU sed):
sed -i -E 's/^(\s*)(TIMEOUT|TIME_OUT)(\s*=\s*)([0-9]+)/\1\2\3\400000000000/' .env

# For macOS (BSD sed):
sed -i '' -E 's/^(\s*)(TIMEOUT|TIME_OUT)(\s*=\s*)([0-9]+)/\1\2\3\400000000000/' .env

https://github.com/langgenius/dify/issues/14416#issuecomment-2684579780

My problem is that when I use a local deployed version of Deepseek R1 671B with Ktransformers which typically takes 20 minutes to respond if prompted "Write me a 10000 words news for car industries" in Dify chatbot apps, the frontend will stop responding even if the LLM backend is still running and producing output. In the backend I can retrieve the full response so the LLM isworking properly, not exceeding the context window. I have checked with Chatbox and the model is able to properly stream full outputs to the dialog window, so no network issues.

According to the comments in the .env file I should only enlarge the value of GUNICORN_TIMEOUT since in the web developer console the streaming responses from /chat-messages are of type eventsource or server-sent events (SSE), but it never hurts to enlarge every possible timeout presets if only for personal use.

Additionally please check both the max context window and max response length in Dify and your LLM backend configs.

I would recommend the official to document this issue in their local deployment tutorials and offer a simple script or web config editor for adjusting timeout related values.

James4Ever0 avatar Jun 01 '25 12:06 James4Ever0

use

uv run gunicorn -w 8 -b 0.0.0.0:5001 app:app --worker-class gevent

kalsolio avatar Jun 05 '25 02:06 kalsolio

I checked this issue was fixed on 1.4.3 (maybe 1.4.2). Tested multiple times - streaming now works correctly with long prompt histories. Closing as resolved.

Thank you very much.

akriaueno avatar Jun 17 '25 14:06 akriaueno