[Bug]: During local embedding,RAGFlow is sending too much text at once, exceeding the model's maximum token limit, causing the model to be unable to fully read the input.
Is there an existing issue for the same bug?
- [x] I have checked the existing issues.
RAGFlow workspace code commit ID
main
RAGFlow image version
v0.15.1,nightly
Other environment information
Actual behavior
When employing embedding models with a lower 'maximum input token' capacity,models such as bge-large and conan-embedding-v1 are limited to a maximum input of 512 tokens. When using these models for embedding, RAGFlow sends more than 512 tokens at once, ollama will encounter an error. I've found the cause of the error here:https://github.com/ollama/ollama/issues/7288#issuecomment-2591709109 Although I can adjust the maximum input limit of the model in ollama, it will cause RAGFlow's text to be truncated, resulting in incomplete embeddings.Additionally, I'm unable to locate a setting within RAGFlow to control the maximum input for the embedding model.
When adding a model, the max token setting controls the maximum output, not the input, which doesn't apply to embedding models.
The same issue of an ineffective max token option also exists when adding reranker models.
Expected behavior
Please add a setting to RAGFlow to control the maximum number of tokens sent to the embedding model per request, and also fix the bug where the max token limit is ineffective when adding reranker models.
Steps to reproduce
Using the bge-large:latest model in ollama, if the embedding is performed with a method other than 'general' (I am using 'book'), and the token count goes over 512, an error occurs and the embedding is terminated.
Additional information
No response
What about reducing chunk token size in chunking method settings?
But it will not slice text apart from the middle of text to ruin the semantics, which is meaningless to embedding.
在分块方法设置中减小块令牌大小怎么样?
但它不会将文本从文本中间切开以破坏语义,这对嵌入毫无意义。
You know, some parsing methods, like QA, resume, manual, table, paper, laws, book, presentation, one, cannot manually set the Chunk token number.
We did not depends embedding a lot since its limitation of long text semantics representation.
Could you please tell me what the significance of the 'max output tokens' setting is for embedding and rerank models?
Sometimes, if the input is too long, the serving of embedding reports error directly without automatically truncating.
Sometimes, if the input is too long, the serving of embedding reports error directly without automatically truncating.
Is that possible add OpenAI-API-Compatible configuration options, such as input length, I often encounter an error that exceeds 8196 when using paper mode for word embedding。
I also get errors when integrating with intfloat/multilingual-e5-large-instruct with General & max token of 512.
This embedding model seems to be well rated in MMTEB https://arxiv.org/abs/2502.13595
HF TEI text-embeddings-inference logs :
2025-02-28T11:28:23.700376Z ERROR openai_embed:embed_pooled{truncate=false truncation_direction=Right prompt_name=None normalize=true}: text_embeddings_core::infer: core/src/in
fer.rs:332: Input validation error: `inputs` must have less than 512 tokens. Given: 562
2025-02-28T11:28:23.703694Z ERROR openai_embed:embed_pooled{truncate=false truncation_direction=Right prompt_name=None normalize=true}: text_embeddings_core::infer: core/src/in
fer.rs:332: Input validation error: `inputs` must have less than 512 tokens. Given: 538
2025-02-28T11:28:23.703902Z ERROR openai_embed:embed_pooled{truncate=false truncation_direction=Right prompt_name=None normalize=true}: text_embeddings_core::infer: core/src/in
fer.rs:332: Input validation error: `inputs` must have less than 512 tokens. Given: 518
2025-02-28T11:29:01.629322Z ERROR openai_embed:embed_pooled{truncate=false truncation_direction=Right prompt_name=None normalize=true}: text_embeddings_core::infer: core/src/in
fer.rs:332: Input validation error: `inputs` must have less than 512 tokens. Given: 996
2025-02-28T11:29:01.631500Z ERROR openai_embed:embed_pooled{truncate=false truncation_direction=Right prompt_name=None normalize=true}: text_embeddings_core::infer: core/src/in
fer.rs:332: Input validation error: `inputs` must have less than 512 tokens. Given: 1192
Kind regards & thanks for your great work. David.
I changed the chunk token number and can see the total chunk number changed from 20 (chunk token number 800) to 51 (chunk token number 128) but the error is still occurring that the requested input length is larger than what the provider can support. And as per the error messages, it seems that the input length doesn't changed at all.
Input length of input_ids is `4783` and exceed max_sequence_length: `4096`
No error if I use an OLLAMA local embedding model, maybe the input tokens are truncated in this case.
What about reducing chunk token size in chunking method settings?
But it will not slice text apart from the middle of text to ruin the semantics, which is meaningless to embedding.
it is 0.17.0 now and the problems are still there. You cannot set chunk token number for document types like manual, QA, resume. And the too long embedding input problem still occurs often.
Issue still persists. "Chunk token number" is ignored and ragflow is sending oversized chunks to embedding model.
Hi, I've noticed the same thing with :
- version 0.17.2
- with vLLM embedding intfloat/multilingual-e5-large-instruct limit to 512 max tokens model providers configuration.
- KB General chunking to 250 tokens.
RAGFlow logs :
openai.BadRequestError: Error code: 400 - {'object': 'error', 'message': "This model's maximum context length is 512 tokens. However, you requested 1084 tokens in the input for embedding generation. Please reduce the length of the input.", 'type': 'BadRequestError', 'param': None, 'code': 400}
vLLM logs :
ERROR 04-14 10:05:44 [serving_embedding.py:143] raise ValueError(
ERROR 04-14 10:05:44 [serving_embedding.py:143] ValueError: This model's maximum context length is 512 tokens. However, you requested 1084 tokens in the input for embedding generation. Please reduce the length of the input.
INFO: 172.18.0.6:45496 - "POST /v1/embeddings HTTP/1.1" 400 Bad Request
INFO 04-14 10:05:54 [metrics.py:488] Avg prompt throughput: 0.9 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
Kind regards,
Sometimes, if the input is too long, the serving of embedding reports error directly without automatically truncating.
how else do you retrieve texts ?
We did not depends embedding a lot since its limitation of long text semantics representation.
how else do you retrieve texts ?
meet too. how can i address this problem?
Error still exists in Version: v0.19.1 slim
ERROR: status_code: 400, body: {'object': 'error', 'message': "This model's maximum context length is 512 tokens. However, you requested 532 tokens in the input for embedding generation. Please reduce the length of the input.", 'type': 'BadRequestError', 'param': None, 'code': 400}
Sometimes, if the input is too long, the serving of embedding reports error directly without automatically truncating.
I think this is a serious bug. Since a chunking strategy is used, the size of each chunk should be aligned with the set size, rather than using separators to determine whether to split into new chunks, nor directly truncating. Instead, when it exceeds the chunk size, a new chunk should be formed. Otherwise, there will always be an error indicating an exceeded maximum length, leading to parsing failure. This way, it can ensure that the length is not exceeded during embedding, and also ensure that it does not exceed the length when using rerank models or large language models. The damage to the semantics is not very serious.