rag-experiment-accelerator icon indicating copy to clipboard operation
rag-experiment-accelerator copied to clipboard

Adding semantic HTML chunking using Unstructured.io

Open tarockey opened this issue 11 months ago • 1 comments

Still in progress, but adding semantic HTML chunking.

The strategy should apply to the rest of the document chunkers.

Overall method:

  1. use Unstructured to chunk by title
  2. use an embedding model to embed each split chunk NOTE: currently using spacy by default. open task to add configurable embedding model.
  3. compare split chunks to find and combine semantically similar chunks

Changes:

  • Added SEMANTIC as a chunking strategy to configuration
  • Added SEMANTIC_SIMILARITY_THRESHOLD to configuration (this defines how semantically similar two chunks must be, before they are combined)
  • Updated load_documents to accept a config as a parameter, to allow these params to be passed, without overloading load_documents
  • updated html_loader to include semantic chunking logic.

tarockey avatar Mar 22 '24 21:03 tarockey

  • [x] Do not merge until prerelease in merged into development. #556

julia-meshcheryakova avatar Aug 14 '24 14:08 julia-meshcheryakova