ragflow [Question]:

Self Checks

[x] I have searched for existing issues search for existing issues, including closed ones.
[x] I confirm that I am using English to submit this report (Language Policy).
[x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
[x] Please do not modify this template :) and fill in all the required fields.

Describe your problem

When I use the txt_parser parser for text chunking, I set the maximum chunk_token_num=512, why is the size of some of the chunks generated still more than 512, and do I need to make changes elsewhere?

Jun 17 '25 10:06 rookieCun

Hi @rookieCun! I'm Dosu and I’m helping the ragflow team.

The chunk_token_num in txt_parser acts as a soft upper bound—if a segment between delimiters is longer than 512 tokens, the resulting chunk can exceed your limit. This happens because the parser only starts a new chunk after the previous one is already over the threshold, so long paragraphs or sections without delimiters can cause overflow. To get more consistent chunk sizes, try adjusting your delimiter to split the text into smaller pieces code reference.

Also, if you’re setting chunk_token_num via API, double-check that you’re using the correct field name (chunk_token_num, not chunk_token_count). Using the wrong field can cause the default value to be used instead of your setting related issue. If you still see the wrong value in the dashboard, try updating the config there to confirm the change takes effect.

There aren’t any other settings that strictly enforce a hard maximum chunk size—tuning the delimiter is your best option for controlling chunk overflow.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Jun 17 '25 10:06 dosubot[bot]

@dosuThe default value is 512, why are some of the chunks generated more than 1000, and what criteria is chunk_token_num set to in ragflow?

Jun 17 '25 10:06 rookieCun

The chunk_token_num in RAGFlow's txt_parser is a soft upper bound, not a strict limit. If a single section of text between delimiters is longer than 512 tokens, that entire section becomes a chunk—even if it exceeds 1000 tokens. This is why you see oversized chunks: the parser accumulates text until it crosses the threshold, but doesn't split inside a section. To get smaller chunks, use more frequent or finer-grained delimiters so sections are shorter (source).

Also, make sure you’re using the correct config field name: chunk_token_num (not chunk_token_count). Using the wrong field can cause the default to be used instead of your setting (source). If you’re on an older version and configs aren’t applying, this may be a known bug—upgrading should fix it.

There’s no hard cutoff or secondary setting that enforces a strict max chunk size; delimiter choice is the main way to control chunk overflow (source).

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Jun 17 '25 10:06 dosubot[bot]

The text has been sliced based on both chunk token size and delimitors. RAGFlow does not split text between delimitors just due to chunk size since it will break the semantic integrity much.

Jun 18 '25 02:06 KevinHuSh

This I know, in the text parser I changed the separator to use the special separator added during document preprocessing, and also changed the desired chunk size. But the final rendered result only meets the separator requirement, and does not achieve the required fixed chunk size.

Jun 18 '25 02:06 rookieCun

我知道这一点，在文本解析器中，我把分隔符改成了文档预处理时添加的特殊分隔符，并且还修改了所需的块大小。但最终渲染结果只满足了分隔符的要求，并没有达到所需的固定块大小。@dosu

Jun 19 '25 02:06 rookieCun