ragas Feedback Request: Future of Testset Generation Module in Ragas v0.4

Hello Ragas community! 👋

As we prepare for the Ragas v0.4 release, we'd like to gather your feedback on the future direction of our Testset Generation module. This module has been part of Ragas for some time now, and while we have several community PRs ready to merge for the upcoming release, we want to ensure we're aligned with the community's needs going forward.

We'd love your input on the following options:

Keep it in core - Continue maintaining Testset Generation as a core part of the Ragas library
Separate package - Extract it into its own standalone package outside the main Ragas repository
Deprecate - Phase it out if it's not providing sufficient value to users

How to participate:

📊 Vote by reacting to this issue with:
- 👍 for Option 1 (Keep in core)
- 🚀 for Option 2 (Separate package)
- 👎 for Option 3 (Deprecate)
💭 Comment with your reasoning - this is especially valuable! Please share:
- How you currently use (or don't use) the Testset Generation module
- What challenges or benefits you've experienced
- Any suggestions for improvement

Your feedback will directly influence our decision-making process and help us build a better tool for everyone.

P.S. We're excited to share more about our v1.0 roadmap in the coming weeks, so stay tuned! ❤️

- Team Ragas

Aug 28 '25 12:08 jjmachan

This is where we use the TestsetGenerator, to generate ground truth data for evaluations: https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/evals/generate_ground_truth.py It's been helpful at generating better ground truth than any other technique I've tried. The main drawback is that its also slower, especially as number of documents increase. Fortunately, we don't have to run it very often.

I voted for separate package, but I don't have strong opinion about separate package vs core.

Aug 28 '25 14:08 pamelafox

Synthetic testset generation is critically important to good evaluation, allowing teams to overcome a cold start problem when they do not have examples from production use that have been evaluated for ground-truth accuracy. To that end, I believe deprecation is a Very Bad Idea.

My use is specifically the above; the synthetic testset provides a static benchmark over our documents that we can use to optimize our RAG pipeline and various hyperparameters. In my experience, the testset generation scales somewhat poorly due to all in-memory implementations of documents, chunks, relationship builders, etc., and I have added some PRs to make some of these processes more efficient (though they are still memory constrained).

I have no preference for separation vs inclusion in Ragas main; I think it should be what allows the core developers easiest maintenance. Testset generation is not necessarily tightly-coupled with the eval suite, but it is nice to have the evaluation dataset created in the same data structure that the evaluator uses. I also believe there are some underlying abstractions (e.g., PydanticPrompt) that are used in both testset generation and evaluation, so separating these into multiple packages might mean duplicating code or dependency via imports.

Suggestions for improvement:

use local disk solution to reduce system memory requirements (sqlite, tursodb); bonus if the solution provides optimized vector operations
reduce reliance on langchain (personal pet peeve)
improve lineage tracing for document/chunk that was used to synthesize question/answer

Aug 28 '25 14:08 ahgraber