kotaemon icon indicating copy to clipboard operation
kotaemon copied to clipboard

[REQUEST] Are You Using a Max Length Chunking Strategy for All File Types?

Open QuangTQV opened this issue 1 year ago • 1 comments

Reference Issues

No response

Summary

It seems that a max length chunking strategy is being used for all file types. I believe that each file type should have its own chunking strategy to optimize accuracy.

Implementing customized chunking strategies based on file types could improve the overall precision of the system by taking into account the unique structure and content of each file type.

Basic Example

For example:

Markdown files could be chunked based on headers. DOCX files could be split into sections or paragraphs, and if a paragraph is too small, it can be merged with adjacent ones. Additionally, semantic similarity between two chunks could be used to decide whether they should be combined.

Drawbacks

None

Additional information

Optimizing chunking per file type is very important for improving accuracy. This adjustment would help create more meaningful chunks and enhance the overall performance.

QuangTQV avatar Oct 22 '24 09:10 QuangTQV

what is the chunking method of Kotaemon ? Tokens, semantic, sentence... ? Kinf regards,

dromeuf avatar Apr 15 '25 12:04 dromeuf