[Bug]:On variable chunk sizes (< max_chunks size) the gleaning function gives a context windows overflow (using vllm + OpenAI conncection)

Open Dorn8010 opened this issue 3 weeks ago • 4 comments

Do you need to file an issue?

[x] I have searched the existing issues and this bug is not already filed.
[x] I believe this is a legitimate bug, not just a question or feature request.

Describe the bug

The gleaning function is adding to much history a so there is an overflow of the context windows and the vllm server refuses to reply. Pls make sure that the gleaning function removes as much history as needed to fit inside the max token window.

Steps to reproduce

Use e.g. recursive chunking in operate.py, the use gleaning. The max token limit is not respected anymore and leads to a failed injection.

Expected Behavior

The gleaning fuction cuts the history short to fit into the context window limitation

LightRAG Config Used

Paste your config here

Logs and screenshots

�RuntimeError: chunk-f8eee4beb4e00017948dd4c22067bc37: BadRequestError: Error code: 400 - {'error': {'message': "This model's maximum context length is 36000 tokens. However, your request has 36389 input tokens. Please reduce the length of the input messages. None", 'type': 'BadRequestError', 'param': None, 'code': 400}} (Original exception could not be reconstructed: APIStatusError.init() missing 2 required keyword-only arguments: 'response' and 'body')

Additional Information

LightRAG Version: vv1.4.9.8/0251
Operating System: Ubuntu 24.04
Python Version: 3.12
Related Issues:

Dec 02 '25 17:12 Dorn8010

hii @Dorn8010 can you assign this issue to me. i really like to work on this

Dec 02 '25 19:12 Aryan-x677

I also get that error, for me I have not ruled out a configuration problem though. Would you be so kind as to tell me which parameters I need to set to which values to accustom my vllm context window?

I run qwen3 with a context window of 16384.

When initializing LightRAG, I set rag = LightRAG(..., llm_model_max_token_size=16384, ...) and before I query I set

query_param = QueryParam(
        mode='hybrid',
        top_k=5,
        max_token_for_text_unit=4000,
        max_token_for_global_context=4000,
        max_token_for_local_context=4000
    )

response = rag.query(
        question["question"],
        param=query_param,
        system_prompt=SYSTEM_PROMPT
    )

Do i have to choose top_k and max_token_for_text_unit so that top_k * max_token_for_text_unit < context_window? Is there another relation that I dont know of? My thinking was that max_token_for_text_unit + max_token_for_global_context + max_token_for_local_context < context_window, but that seems to be a wrong assumption.

Dec 03 '25 14:12 Powerkrieger

@Powerkrieger The query parameter max_token_for_text_unit has been deprecated; please use max_total_tokens instead. The token count for text chunks is calculated as: max_total_tokens - max_token_for_global_context - max_token_for_local_context.

Dec 12 '25 04:12 danielaskdd

@Dorn8010 We may need to introduce an environment variable to control whether the gleaning stage is included during entity and relation extraction. If the length of the LLM’s initial response exceeds a predefined threshold, the gleaning stage will be skipped.

Dec 12 '25 04:12 danielaskdd