private-gpt
private-gpt copied to clipboard
Change chunking/splitting method from SentenceWindowNodeParser to JSONNodeParser
Question
I’m currently using PrivateGPT v0.6.1 with Llama-CPP
support on a Windows machine with qdrant
DB. LLM used is Mistral-7B-Instruct-v0.3
and embedding model is BAAI/bge-m3
.
I have a situation where I need to ingest a large JSON file - say a telephone directory, where each record should remain intact as a single node. When using the SentenceWindowNodeParser, the records often split at improper places, leading to jumbled responses when querying the LLM, especially when it comes to matching users to their telephone numbers.
I made the following changes to ingest_service.py
- Replaced the import statement
from llama_index.core.node_parser import SentenceWindowNodeParser
withfrom llama_index.core.node_parser import JSONNodeParser
- Replaced
node_parser = SentenceWindowNodeParser.from_defaults()
withnode_parser = JSONNodeParser.from_defaults()
After making these changes, I tried ingesting the JSON file again. It didn’t throw any errors, but the console showed that the file was converted into 1 document, with a message saying: private_gpt.components.ingest.ingest_component - Inserting count=0 nodes in the index
. As expected, I don't see any nodes in Qdrant.
What am I missing? Your advice would be greatly appreciated!