Generate Synthetic Dataset from pinecone index
As an RAG developer, I already have a huge data ingestion pipeline managing data ingestion for me in the Pinecone. Since ETL processes are computationally expensive, one would want to refrain from redoing them. However, I now want to generate questions from the chunked and stored documents within my Pinecone index. But ragas takes only langachain documents or llamaindex documents/nodes. How to approach this?
Hey @rnbokade This is something we might introduce in the near future. There are some challenges associated with it, including bypassing the chunk size bias. For example, if we create questions out of predefined chunks the questions will be mostly of poor quality and will be obviously biased towards the chunk size you have defined. This will prevent ragas from forming high-quality QA pairs.
For now I can create a preliminary docstore implementation with add , similaroty search, fetch and all those abstract methods. Maybe we can fine tune it once people start using it.
Hey @rnbokade that makes sense. Would love to see a PR if you are able to do it.