[REQUEST] Are You Using a Max Length Chunking Strategy for All File Types?
Reference Issues
No response
Summary
It seems that a max length chunking strategy is being used for all file types. I believe that each file type should have its own chunking strategy to optimize accuracy.
Implementing customized chunking strategies based on file types could improve the overall precision of the system by taking into account the unique structure and content of each file type.
Basic Example
For example:
Markdown files could be chunked based on headers. DOCX files could be split into sections or paragraphs, and if a paragraph is too small, it can be merged with adjacent ones. Additionally, semantic similarity between two chunks could be used to decide whether they should be combined.
Drawbacks
None
Additional information
Optimizing chunking per file type is very important for improving accuracy. This adjustment would help create more meaningful chunks and enhance the overall performance.
what is the chunking method of Kotaemon ? Tokens, semantic, sentence... ? Kinf regards,