rag-experiment-accelerator
rag-experiment-accelerator copied to clipboard
Adding semantic HTML chunking using Unstructured.io
Still in progress, but adding semantic HTML chunking.
The strategy should apply to the rest of the document chunkers.
Overall method:
- use Unstructured to chunk by title
- use an embedding model to embed each split chunk NOTE: currently using spacy by default. open task to add configurable embedding model.
- compare split chunks to find and combine semantically similar chunks
Changes:
- Added SEMANTIC as a chunking strategy to configuration
- Added SEMANTIC_SIMILARITY_THRESHOLD to configuration (this defines how semantically similar two chunks must be, before they are combined)
- Updated load_documents to accept a config as a parameter, to allow these params to be passed, without overloading load_documents
- updated html_loader to include semantic chunking logic.
- [x] Do not merge until prerelease in merged into development. #556