[Bug] Chinese character duplication after chunking since version: 0.97.250211.1 (docker)
Context / Scenario
When processing a large Chinese document (in Markdown format, around 100 KB), the Chunker introduced in version 0.97.250211.1 starts duplicating certain Chinese characters right after a chunk boundary. Using the same Markdown file, if I switch the Docker image back to version 0.96.250120.1, this issue does not occur.
I extracted a sample file demonstrating this problem. In the _files{documentId}\ folder, there is a file named *.partition.1.txt. Starting from the second chunk, it shows the following (actual content):
的開發發者大會會,發發布了一個個新的服務務
(ignored)
However, the expected content should be:
的開發者大會,發布了一個新的服務
(ignored)
As you can see, several Chinese characters are duplicated after the text is split by the chunker.
Steps to Reproduce
- Use a Markdown file (around 50 ~ 100 KB in size) that contains Chinese text.
- Process the file with Chunker version 0.97.250211.1.
- Observe that in subsequent chunks (e.g., the second chunk), some Chinese characters become duplicated.
- Switch to version 0.96.250120.1 using the same file and confirm that the duplication issue does not occur.
Expected Behavior
Text should remain consistent without any duplication after being split into chunks, just as it does in the previous Docker image version.
Actual Behavior
Characters within the chunk boundary are duplicated, causing incorrect text output.
Please let me know if additional information or further examples are needed. Thank you!
What happened?
My testing file: 2024-01-15-archview-llm.md
Importance
edge case
Platform, Language, Versions
Windows 11 24H2, run docker version in WSL2 use docker image: kernelmemory/service:0.97.250211.1-b-amd64
Relevant log output