paper-qa icon indicating copy to clipboard operation
paper-qa copied to clipboard

Chunking strategy and answer relevance vs. Docs() object size ?

Open Snikch63200 opened this issue 1 year ago • 1 comments

Hello,

I'm looking for an optimal chunking strategy to get relevant answers for my queries.

I tried these parmeters provided in "high quality config" :

  • chunk size : 7000
  • chunk overlap : 250

It gives poor results...

This config seems giving better results :

  • chunk size : 1024
  • chunk overlap : 128

The problem is I work on a large set of documents (more than 1500) with sometimes huge documents (more than 300 pages). Small chunk size with recommended 10-20 % overlap gives huge Docs() object...

Are smaller chunk sizes better than biggest ones in terms of relevance ?

Is there a way to not create huge objects ?

Best regards.

Snikch63200 avatar Oct 18 '24 09:10 Snikch63200

Smaller chunk sizes can indeed improve the relevance of answers by providing more focused context, but they can also lead to larger Docs() objects, especially with a large number of documents. The optimal chunking strategy in the "high quality config" suggests a chunk size of 7000 with an overlap of 250, which balances context size and continuity between chunks [1].

To manage the size of Docs() objects while maintaining relevance, consider parallelizing the processing of documents. The gather_with_concurrency function in paper-qa allows for concurrent processing, which can help manage memory usage and improve performance [2][3]. Adjusting the max_concurrent_requests setting can also help optimize resource usage based on your system's capabilities.

To continue talking to Dosu, mention @dosu.

dosubot[bot] avatar Oct 18 '24 09:10 dosubot[bot]