paper-qa
paper-qa copied to clipboard
Chunking strategy and answer relevance vs. Docs() object size ?
Hello,
I'm looking for an optimal chunking strategy to get relevant answers for my queries.
I tried these parmeters provided in "high quality config" :
- chunk size : 7000
- chunk overlap : 250
It gives poor results...
This config seems giving better results :
- chunk size : 1024
- chunk overlap : 128
The problem is I work on a large set of documents (more than 1500) with sometimes huge documents (more than 300 pages). Small chunk size with recommended 10-20 % overlap gives huge Docs() object...
Are smaller chunk sizes better than biggest ones in terms of relevance ?
Is there a way to not create huge objects ?
Best regards.
Smaller chunk sizes can indeed improve the relevance of answers by providing more focused context, but they can also lead to larger Docs() objects, especially with a large number of documents. The optimal chunking strategy in the "high quality config" suggests a chunk size of 7000 with an overlap of 250, which balances context size and continuity between chunks [1].
To manage the size of Docs() objects while maintaining relevance, consider parallelizing the processing of documents. The gather_with_concurrency function in paper-qa allows for concurrent processing, which can help manage memory usage and improve performance [2][3]. Adjusting the max_concurrent_requests setting can also help optimize resource usage based on your system's capabilities.
To continue talking to Dosu, mention @dosu.