[Bug]:On variable chunk sizes (< max_chunks size) the gleaning function gives a context windows overflow (using vllm + OpenAI conncection)
Do you need to file an issue?
- [x] I have searched the existing issues and this bug is not already filed.
- [x] I believe this is a legitimate bug, not just a question or feature request.
Describe the bug
The gleaning function is adding to much history a so there is an overflow of the context windows and the vllm server refuses to reply. Pls make sure that the gleaning function removes as much history as needed to fit inside the max token window.
Steps to reproduce
Use e.g. recursive chunking in operate.py, the use gleaning. The max token limit is not respected anymore and leads to a failed injection.
Expected Behavior
The gleaning fuction cuts the history short to fit into the context window limitation
LightRAG Config Used
Paste your config here
Logs and screenshots
�RuntimeError: chunk-f8eee4beb4e00017948dd4c22067bc37: BadRequestError: Error code: 400 - {'error': {'message': "This model's maximum context length is 36000 tokens. However, your request has 36389 input tokens. Please reduce the length of the input messages. None", 'type': 'BadRequestError', 'param': None, 'code': 400}} (Original exception could not be reconstructed: APIStatusError.init() missing 2 required keyword-only arguments: 'response' and 'body')
Additional Information
- LightRAG Version: vv1.4.9.8/0251
- Operating System: Ubuntu 24.04
- Python Version: 3.12
- Related Issues:
hii @Dorn8010 can you assign this issue to me. i really like to work on this
I also get that error, for me I have not ruled out a configuration problem though. Would you be so kind as to tell me which parameters I need to set to which values to accustom my vllm context window?
I run qwen3 with a context window of 16384.
When initializing LightRAG, I set rag = LightRAG(..., llm_model_max_token_size=16384, ...) and before I query I set
query_param = QueryParam(
mode='hybrid',
top_k=5,
max_token_for_text_unit=4000,
max_token_for_global_context=4000,
max_token_for_local_context=4000
)
response = rag.query(
question["question"],
param=query_param,
system_prompt=SYSTEM_PROMPT
)
Do i have to choose top_k and max_token_for_text_unit so that top_k * max_token_for_text_unit < context_window? Is there another relation that I dont know of? My thinking was that max_token_for_text_unit + max_token_for_global_context + max_token_for_local_context < context_window, but that seems to be a wrong assumption.
@Powerkrieger The query parameter max_token_for_text_unit has been deprecated; please use max_total_tokens instead. The token count for text chunks is calculated as: max_total_tokens - max_token_for_global_context - max_token_for_local_context.
@Dorn8010 We may need to introduce an environment variable to control whether the gleaning stage is included during entity and relation extraction. If the length of the LLM’s initial response exceeds a predefined threshold, the gleaning stage will be skipped.