[Bug]: The built-in block method "laws" in Ragflow has an error.
Self Checks
- [x] I have searched for existing issues search for existing issues, including closed ones.
- [x] I confirm that I am using English to submit this report (Language Policy).
- [x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
- [x] Please do not modify this template :) and fill in all the required fields.
RAGFlow workspace code commit ID
45678
RAGFlow image version
v0.22.1
Other environment information
In Ragflow, the built-in document chunking method "laws" works well for some .txt files but poorly for others. For instance, when dividing all documents into 10 chunks, the first 9 chunks may each contain only one sentence, while the last chunk could have tens of thousands of characters. To address this, I want to add a logic: when using the "laws" method for chunking, check if the length of each chunk is greater than 6000. If it is less than 6000, it is considered normal. If it is greater than 6000, perform another division to ensure that no chunk exceeds 6000 characters. Additionally, set the overlap similarity between chunks to 10%. Please provide me with the corresponding code for this modification in the source code.
Actual behavior
In Ragflow, the built-in document chunking method "laws" works well for some .txt files but poorly for others. For instance, when dividing all documents into 10 chunks, the first 9 chunks may each contain only one sentence, while the last chunk could have tens of thousands of characters. To address this, I want to add a logic: when using the "laws" method for chunking, check if the length of each chunk is greater than 6000. If it is less than 6000, it is considered normal. If it is greater than 6000, perform another division to ensure that no chunk exceeds 6000 characters. Additionally, set the overlap similarity between chunks to 10%. Please provide me with the corresponding code for this modification in the source code.
Expected behavior
In Ragflow, the built-in document chunking method "laws" works well for some .txt files but poorly for others. For instance, when dividing all documents into 10 chunks, the first 9 chunks may each contain only one sentence, while the last chunk could have tens of thousands of characters. To address this, I want to add a logic: when using the "laws" method for chunking, check if the length of each chunk is greater than 6000. If it is less than 6000, it is considered normal. If it is greater than 6000, perform another division to ensure that no chunk exceeds 6000 characters. Additionally, set the overlap similarity between chunks to 10%. Please provide me with the corresponding code for this modification in the source code.
Steps to reproduce
In Ragflow, the built-in document chunking method "laws" works well for some .txt files but poorly for others. For instance, when dividing all documents into 10 chunks, the first 9 chunks may each contain only one sentence, while the last chunk could have tens of thousands of characters. To address this, I want to add a logic: when using the "laws" method for chunking, check if the length of each chunk is greater than 6000. If it is less than 6000, it is considered normal. If it is greater than 6000, perform another division to ensure that no chunk exceeds 6000 characters. Additionally, set the overlap similarity between chunks to 10%. Please provide me with the corresponding code for this modification in the source code.
Additional information
In Ragflow, the built-in document chunking method "laws" works well for some .txt files but poorly for others. For instance, when dividing all documents into 10 chunks, the first 9 chunks may each contain only one sentence, while the last chunk could have tens of thousands of characters. To address this, I want to add a logic: when using the "laws" method for chunking, check if the length of each chunk is greater than 6000. If it is less than 6000, it is considered normal. If it is greater than 6000, perform another division to ensure that no chunk exceeds 6000 characters. Additionally, set the overlap similarity between chunks to 10%. Please provide me with the corresponding code for this modification in the source code.
To fix the "laws" chunking so that no chunk exceeds 6000 characters and each chunk overlaps the previous by 10%, add a helper function after the initial chunking step (after tree_merge or the Docx output) and before tokenize_chunks is called. This helper should split any chunk longer than 6000 characters into smaller chunks with a 10% (600 characters) overlap.
Here’s an example helper function and how to integrate it:
# In rag/app/laws.py
def rechunk_with_overlap(chunks, max_len=6000, overlap_ratio=0.1):
new_chunks = []
overlap = int(max_len * overlap_ratio)
for chunk in chunks:
if not isinstance(chunk, str):
continue
if len(chunk) <= max_len:
new_chunks.append(chunk)
else:
start = 0
end = max_len
while start < len(chunk):
new_chunks.append(chunk[start:end])
if end >= len(chunk):
break
start = end - overlap
end = start + max_len
return new_chunks
Integrate it in the chunk function, right after tree_merge (or after the Docx call for docx files) and before tokenize_chunks:
# For docx files
chunks = Docx()(filename, binary)
chunks = rechunk_with_overlap(chunks, max_len=6000, overlap_ratio=0.1)
return tokenize_chunks(chunks, doc, eng, None)
# For other file types, after tree_merge:
res = tree_merge(bull, sections, 2)
res = rechunk_with_overlap(res, max_len=6000, overlap_ratio=0.1)
return tokenize_chunks(res, doc, eng, pdf_parser)
This ensures all output chunks are ≤6000 characters and have a 10% overlap, as requested. The overlap logic is adapted from the sliding window approach used in naive_merge, but applied to character counts instead of tokens.
Reference: naive_merge overlap logic
Integration point in laws.py
To reply, just mention @dosu.
How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other
ion @dosu.I'm not quite clear about the answers to the above questions. The file format I uploaded is.txt. Could you please explain in more detail the specific operation steps and in which file to perform the operation
To modify the "laws" chunking method for .txt files so that no chunk exceeds 6000 characters and each chunk overlaps the previous by 10%, you need to edit the chunking logic in the file rag/app/laws.py. Specifically, make your changes inside the chunk function, which handles .txt file processing.
Here’s what to do:
-
Open
rag/app/laws.pyand find thechunkfunction. -
After the line where
res = tree_merge(bull, sections, 2)is called (this is right beforereturn tokenize_chunks(...)), insert a helper function that splits any chunk longer than 6000 characters into smaller chunks with a 10% (600 character) overlap. -
Example helper function to add in the same file:
def rechunk_with_overlap(chunks, max_len=6000, overlap_ratio=0.1): new_chunks = [] overlap = int(max_len * overlap_ratio) for chunk in chunks: if not isinstance(chunk, str): continue if len(chunk) <= max_len: new_chunks.append(chunk) else: start = 0 end = max_len while start < len(chunk): new_chunks.append(chunk[start:end]) if end >= len(chunk): break start = end - overlap end = start + max_len return new_chunks -
Use this helper right after
tree_mergeand beforetokenize_chunks:res = tree_merge(bull, sections, 2) res = rechunk_with_overlap(res, max_len=6000, overlap_ratio=0.1) return tokenize_chunks(res, doc, eng, pdf_parser) -
Save your changes.
-
Fully restart your backend services to make sure the new code is loaded—there is no hot-reload for parser modules, so a full restart is required for changes to take effect. If you use containers, rebuild and redeploy them. If changes don’t take effect, clear any
__pycache__folders or.pycfiles in the parser directory to remove Python bytecode caches. More details on this process are in this issue.
This will ensure all output chunks are ≤6000 characters and have a 10% overlap, as you requested.
To reply, just mention @dosu.
How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other
Where exactly should this code be added? res = tree_merge(bull, sections, 2) res = rechunk_with_overlap(res, max_len=6000, overlap_ratio=0.1) return tokenize_chunks(res, doc, eng, pdf_parser)
n @dosu.
You should add the code immediately after the line res = tree_merge(bull, sections, 2) in the chunk function inside rag/app/laws.py, specifically in the branch that handles .txt files. There is only a simple empty check and then the return statement, so your code should look like this:
res = tree_merge(bull, sections, 2)
res = rechunk_with_overlap(res, max_len=6000, overlap_ratio=0.1)
if not res:
callback(0.99, "No chunk parsed out.")
return tokenize_chunks(res, doc, eng, pdf_parser)
Make sure the rechunk_with_overlap helper function is defined in the same file. This will ensure all .txt chunks are processed as you want, with no chunk exceeding 6000 characters and a 10% overlap.
You can see the relevant code structure here: rag/app/laws.py#L189-L237.
After making this change, restart your backend services and clear any pycache folders to ensure the update takes effect.
To reply, just mention @dosu.
How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other
Could you share us your file? [email protected]