private-gpt Change chunking/splitting method from SentenceWindowNodeParser to JSONNodeParser

Change chunking/splitting method from SentenceWindowNodeParser to JSONNodeParser

Open Beet-Farms opened this issue 5 months ago • 2 comments

Question

I’m currently using PrivateGPT v0.6.1 with Llama-CPP support on a Windows machine with qdrant DB. LLM used is Mistral-7B-Instruct-v0.3 and embedding model is BAAI/bge-m3.

I have a situation where I need to ingest a large JSON file - say a telephone directory, where each record should remain intact as a single node. When using the SentenceWindowNodeParser, the records often split at improper places, leading to jumbled responses when querying the LLM, especially when it comes to matching users to their telephone numbers.

I made the following changes to ingest_service.py

Replaced the import statement from llama_index.core.node_parser import SentenceWindowNodeParser with from llama_index.core.node_parser import JSONNodeParser
Replaced node_parser = SentenceWindowNodeParser.from_defaults() with node_parser = JSONNodeParser.from_defaults()

After making these changes, I tried ingesting the JSON file again. It didn’t throw any errors, but the console showed that the file was converted into 1 document, with a message saying: private_gpt.components.ingest.ingest_component - Inserting count=0 nodes in the index. As expected, I don't see any nodes in Qdrant.

What am I missing? Your advice would be greatly appreciated!

Aug 29 '24 00:08 Beet-Farms

private-gpt private-gpt copied to clipboard

Change chunking/splitting method from SentenceWindowNodeParser to JSONNodeParser

Question

private-gpt
private-gpt copied to clipboard