ragflow [Bug]: The built-in block method "laws" in Ragflow has an error.

Self Checks

[x] I have searched for existing issues search for existing issues, including closed ones.
[x] I confirm that I am using English to submit this report (Language Policy).
[x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
[x] Please do not modify this template :) and fill in all the required fields.

RAGFlow workspace code commit ID

45678

RAGFlow image version

v0.22.1

Other environment information

In Ragflow, the built-in document chunking method "laws" works well for some .txt files but poorly for others. For instance, when dividing all documents into 10 chunks, the first 9 chunks may each contain only one sentence, while the last chunk could have tens of thousands of characters. To address this, I want to add a logic: when using the "laws" method for chunking, check if the length of each chunk is greater than 6000. If it is less than 6000, it is considered normal. If it is greater than 6000, perform another division to ensure that no chunk exceeds 6000 characters. Additionally, set the overlap similarity between chunks to 10%. Please provide me with the corresponding code for this modification in the source code.

Actual behavior

In Ragflow, the built-in document chunking method "laws" works well for some .txt files but poorly for others. For instance, when dividing all documents into 10 chunks, the first 9 chunks may each contain only one sentence, while the last chunk could have tens of thousands of characters. To address this, I want to add a logic: when using the "laws" method for chunking, check if the length of each chunk is greater than 6000. If it is less than 6000, it is considered normal. If it is greater than 6000, perform another division to ensure that no chunk exceeds 6000 characters. Additionally, set the overlap similarity between chunks to 10%. Please provide me with the corresponding code for this modification in the source code.

Expected behavior

In Ragflow, the built-in document chunking method "laws" works well for some .txt files but poorly for others. For instance, when dividing all documents into 10 chunks, the first 9 chunks may each contain only one sentence, while the last chunk could have tens of thousands of characters. To address this, I want to add a logic: when using the "laws" method for chunking, check if the length of each chunk is greater than 6000. If it is less than 6000, it is considered normal. If it is greater than 6000, perform another division to ensure that no chunk exceeds 6000 characters. Additionally, set the overlap similarity between chunks to 10%. Please provide me with the corresponding code for this modification in the source code.

Steps to reproduce

In Ragflow, the built-in document chunking method "laws" works well for some .txt files but poorly for others. For instance, when dividing all documents into 10 chunks, the first 9 chunks may each contain only one sentence, while the last chunk could have tens of thousands of characters. To address this, I want to add a logic: when using the "laws" method for chunking, check if the length of each chunk is greater than 6000. If it is less than 6000, it is considered normal. If it is greater than 6000, perform another division to ensure that no chunk exceeds 6000 characters. Additionally, set the overlap similarity between chunks to 10%. Please provide me with the corresponding code for this modification in the source code.

Additional information

In Ragflow, the built-in document chunking method "laws" works well for some .txt files but poorly for others. For instance, when dividing all documents into 10 chunks, the first 9 chunks may each contain only one sentence, while the last chunk could have tens of thousands of characters. To address this, I want to add a logic: when using the "laws" method for chunking, check if the length of each chunk is greater than 6000. If it is less than 6000, it is considered normal. If it is greater than 6000, perform another division to ensure that no chunk exceeds 6000 characters. Additionally, set the overlap similarity between chunks to 10%. Please provide me with the corresponding code for this modification in the source code.

Dec 10 '25 09:12 lhxxrds

To fix the "laws" chunking so that no chunk exceeds 6000 characters and each chunk overlaps the previous by 10%, add a helper function after the initial chunking step (after tree_merge or the Docx output) and before tokenize_chunks is called. This helper should split any chunk longer than 6000 characters into smaller chunks with a 10% (600 characters) overlap.

Here’s an example helper function and how to integrate it:

# In rag/app/laws.py

def rechunk_with_overlap(chunks, max_len=6000, overlap_ratio=0.1):
    new_chunks = []
    overlap = int(max_len * overlap_ratio)
    for chunk in chunks:
        if not isinstance(chunk, str):
            continue
        if len(chunk) <= max_len:
            new_chunks.append(chunk)
        else:
            start = 0
            end = max_len
            while start < len(chunk):
                new_chunks.append(chunk[start:end])
                if end >= len(chunk):
                    break
                start = end - overlap
                end = start + max_len
    return new_chunks

Integrate it in the chunk function, right after tree_merge (or after the Docx call for docx files) and before tokenize_chunks:

# For docx files
chunks = Docx()(filename, binary)
chunks = rechunk_with_overlap(chunks, max_len=6000, overlap_ratio=0.1)
return tokenize_chunks(chunks, doc, eng, None)

# For other file types, after tree_merge:
res = tree_merge(bull, sections, 2)
res = rechunk_with_overlap(res, max_len=6000, overlap_ratio=0.1)
return tokenize_chunks(res, doc, eng, pdf_parser)

This ensures all output chunks are ≤6000 characters and have a 10% overlap, as requested. The overlap logic is adapted from the sliding window approach used in naive_merge, but applied to character counts instead of tokens.
Reference: naive_merge overlap logic
Integration point in laws.py

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Dec 10 '25 09:12 dosubot[bot]

ion @dosu.I'm not quite clear about the answers to the above questions. The file format I uploaded is.txt. Could you please explain in more detail the specific operation steps and in which file to perform the operation

Dec 10 '25 09:12 lhxxrds

To modify the "laws" chunking method for .txt files so that no chunk exceeds 6000 characters and each chunk overlaps the previous by 10%, you need to edit the chunking logic in the file rag/app/laws.py. Specifically, make your changes inside the chunk function, which handles .txt file processing.

Here’s what to do:

Open rag/app/laws.py and find the chunk function.
After the line where res = tree_merge(bull, sections, 2) is called (this is right before return tokenize_chunks(...)), insert a helper function that splits any chunk longer than 6000 characters into smaller chunks with a 10% (600 character) overlap.

Example helper function to add in the same file:

def rechunk_with_overlap(chunks, max_len=6000, overlap_ratio=0.1):
    new_chunks = []
    overlap = int(max_len * overlap_ratio)
    for chunk in chunks:
        if not isinstance(chunk, str):
            continue
        if len(chunk) <= max_len:
            new_chunks.append(chunk)
        else:
            start = 0
            end = max_len
            while start < len(chunk):
                new_chunks.append(chunk[start:end])
                if end >= len(chunk):
                    break
                start = end - overlap
                end = start + max_len
    return new_chunks

Use this helper right after tree_merge and before tokenize_chunks:

res = tree_merge(bull, sections, 2)
res = rechunk_with_overlap(res, max_len=6000, overlap_ratio=0.1)
return tokenize_chunks(res, doc, eng, pdf_parser)

Save your changes.
Fully restart your backend services to make sure the new code is loaded—there is no hot-reload for parser modules, so a full restart is required for changes to take effect. If you use containers, rebuild and redeploy them. If changes don’t take effect, clear any __pycache__ folders or .pyc files in the parser directory to remove Python bytecode caches. More details on this process are in this issue.

This will ensure all output chunks are ≤6000 characters and have a 10% overlap, as you requested.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Dec 10 '25 09:12 dosubot[bot]

Where exactly should this code be added？ res = tree_merge(bull, sections, 2) res = rechunk_with_overlap(res, max_len=6000, overlap_ratio=0.1) return tokenize_chunks(res, doc, eng, pdf_parser)

n @dosu.

Dec 11 '25 03:12 lhxxrds

You should add the code immediately after the line res = tree_merge(bull, sections, 2) in the chunk function inside rag/app/laws.py, specifically in the branch that handles .txt files. There is only a simple empty check and then the return statement, so your code should look like this:

res = tree_merge(bull, sections, 2)
res = rechunk_with_overlap(res, max_len=6000, overlap_ratio=0.1)

if not res:
    callback(0.99, "No chunk parsed out.")

return tokenize_chunks(res, doc, eng, pdf_parser)

Make sure the rechunk_with_overlap helper function is defined in the same file. This will ensure all .txt chunks are processed as you want, with no chunk exceeding 6000 characters and a 10% overlap.
You can see the relevant code structure here: rag/app/laws.py#L189-L237.

After making this change, restart your backend services and clear any pycache folders to ensure the update takes effect.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Dec 11 '25 03:12 dosubot[bot]

Could you share us your file? [email protected]

Dec 11 '25 03:12 KevinHuSh