ragas icon indicating copy to clipboard operation
ragas copied to clipboard

Generate Synthetic Dataset from pinecone index

Open rnbokade opened this issue 1 year ago • 3 comments

As an RAG developer, I already have a huge data ingestion pipeline managing data ingestion for me in the Pinecone. Since ETL processes are computationally expensive, one would want to refrain from redoing them. However, I now want to generate questions from the chunked and stored documents within my Pinecone index. But ragas takes only langachain documents or llamaindex documents/nodes. How to approach this?

rnbokade avatar Mar 21 '24 09:03 rnbokade

Hey @rnbokade This is something we might introduce in the near future. There are some challenges associated with it, including bypassing the chunk size bias. For example, if we create questions out of predefined chunks the questions will be mostly of poor quality and will be obviously biased towards the chunk size you have defined. This will prevent ragas from forming high-quality QA pairs.

shahules786 avatar Mar 22 '24 16:03 shahules786

For now I can create a preliminary docstore implementation with add , similaroty search, fetch and all those abstract methods. Maybe we can fine tune it once people start using it.

rnbokade avatar Mar 22 '24 17:03 rnbokade

Hey @rnbokade that makes sense. Would love to see a PR if you are able to do it.

shahules786 avatar Mar 22 '24 17:03 shahules786