Low quality generated synthetic dataset
[ ] I checked the documentation and related resources and couldn't find an answer to my question.
Your Question When I try to generate a synthetic dataset with my docs, I mostly get the same questions/answers (from 20 questions/answers, 12 are the same), which leads to a dataset that is practically useless.
Code Examples testset = generator.generate_with_langchain_docs(documents, test_size=20, distributions={simple: 0.5, reasoning: 0.5})
Hey @WoutDeRijck Can you explain some of the issues you have seen with the generations?
Like I said, when trying to generate questions, say for example 20 questions/answers, I get 12 the same questions and another 8 of the same question (identical), so practically I end up with two different questions.
That is weird @WoutDeRijck can you share your code? So that I can try reproduce.
Hey @WoutDeRijck were you able to solve this? I am also facing this same issue!
hey @Nandakishore-Thekkadathu , since this a duplicate and closing this, lets chat in the other one 🙂