ragas icon indicating copy to clipboard operation
ragas copied to clipboard

Low quality generated synthetic dataset

Open WoutDeRijck opened this issue 1 year ago • 3 comments

[ ] I checked the documentation and related resources and couldn't find an answer to my question.

Your Question When I try to generate a synthetic dataset with my docs, I mostly get the same questions/answers (from 20 questions/answers, 12 are the same), which leads to a dataset that is practically useless.

Code Examples testset = generator.generate_with_langchain_docs(documents, test_size=20, distributions={simple: 0.5, reasoning: 0.5})

WoutDeRijck avatar Mar 20 '24 08:03 WoutDeRijck

Hey @WoutDeRijck Can you explain some of the issues you have seen with the generations?

shahules786 avatar Mar 21 '24 02:03 shahules786

Like I said, when trying to generate questions, say for example 20 questions/answers, I get 12 the same questions and another 8 of the same question (identical), so practically I end up with two different questions.

WoutDeRijck avatar Mar 22 '24 07:03 WoutDeRijck

That is weird @WoutDeRijck can you share your code? So that I can try reproduce.

shahules786 avatar Mar 22 '24 16:03 shahules786

Hey @WoutDeRijck were you able to solve this? I am also facing this same issue!

hey @Nandakishore-Thekkadathu , since this a duplicate and closing this, lets chat in the other one 🙂

jjmachan avatar Jun 04 '24 18:06 jjmachan