ragflow icon indicating copy to clipboard operation
ragflow copied to clipboard

[Question]: Big File Parsing

Open apps-lycusinc opened this issue 2 months ago • 5 comments

Self Checks

  • [x] I have searched for existing issues search for existing issues, including closed ones.
  • [x] I confirm that I am using English to submit this report (Language Policy).
  • [x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
  • [x] Please do not modify this template :) and fill in all the required fields.

Describe your problem

How to parse attached file with raptor on having size in greater then 7 MB having text mostly in ragflow?

Which language models embedding, chat and vision are recommended to work for this in ragflow.

Fbr_IncomeTaxOrdinance_2001-amended-upto30.06.2024.pdf

Raptor is failing for this file with error INVALID_REQUEST - Error code: 400 - {'error': {'message': 'max_tokens must be at least 1, got -4236.', 'type': 'BadRequestError', 'param': None, 'code': 400} despite I set maximum tokens to 512 in raptor s

apps-lycusinc avatar Nov 04 '25 05:11 apps-lycusinc

The error max_tokens must be at least 1, got -4236 happens when a text chunk sent for embedding or generation exceeds the model's token limit. This is common with large PDFs, especially if chunking isn't strict enough or the embedding model has a small context window. To fix this:

  • Use a tokenizer-based chunking strategy: set chunk_token_num to 400–512 (well below your model's context limit), and use frequent delimiters to avoid oversized chunks. If a chunk still exceeds the limit, split it further or discard it. This approach is described in detail here and here.
  • Make sure you use the correct config field name: chunk_token_num (not chunk_token_count).
  • If you still see negative max_tokens, check your RAPTOR config for calculation errors and clamp max_tokens to at least 1 before sending requests.
  • For plain text PDFs, the Naive parser is fastest and skips OCR/layout analysis. For scanned/image-based or complex-layout PDFs, use DeepDoc for OCR and structure extraction, but note it's slower and more resource-intensive (reference).
  • If chunk size settings are ignored or certain document types bypass chunker logic, try splitting the PDF into smaller files or converting it to plain text before uploading (reference).

For embedding large documents, use models with large context windows. For example, BAAI/bge-m3 supports up to 8,000 tokens, while some models only support 512 tokens—using a model with a short context window will cause errors with large files (reference). You can find a list of supported models and their capabilities for embedding, chat, and vision tasks in the official documentation. For vision/image-to-text, models like OpenAI GPT-4 Vision, Gemini, and Qwen are supported.

If you use Ollama models, be aware that the "Max Tokens" field is required for the UI but should not leak into other parts of RagFlow like Raptor. If you encounter config leakage, use only Ollama models for both embedding and chat in knowledge base and Raptor tasks, or use vLLM for local embedding (reference).

For very large files, ensure your system has enough RAM and set vm.max_map_count to at least 262144 to avoid task failures (reference). If you hit analysis failures, check if the analysis size limit needs to be increased.

Let me know if you need help finding the chunk size setting in your workflow or want specific model recommendations for your language and deployment setup.

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other  Chat with Dosu Join Discord Share on X

dosubot[bot] avatar Nov 04 '25 05:11 dosubot[bot]

Where to set chunk_token_num in UI

apps-lycusinc avatar Nov 04 '25 05:11 apps-lycusinc

You can set the chunk_token_num parameter in the RagFlow UI when configuring document parsing or knowledge base settings. Look for the "Chunk Method" configuration modal—when you select a parser type like "Naive" or "KnowledgeGraph," you'll see a field labeled "Max Token Number." This field directly controls chunk_token_num, letting you specify the maximum number of tokens per chunk for document splitting. Adjust this value (e.g., set it to 400–512) to avoid token limit errors with large files. The modal also includes related settings like delimiter and auto keywords, all under the parser configuration section.
Reference: UI component code

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other  Chat with Dosu Join Discord Share on X

dosubot[bot] avatar Nov 04 '25 06:11 dosubot[bot]

What kind of embedding inference service you apply to?

KevinHuSh avatar Nov 04 '25 11:11 KevinHuSh

Since there has been no further activity for over three weeks, we will proceed to close this issue. If the problem persists or you have additional questions, please feel free to reopen the issue or create a new one. We’re happy to assist anytime.

Magicbook1108 avatar Nov 27 '25 02:11 Magicbook1108