`evaluate` supporting replicates
Describe the Feature
It would be nice to do something like evaluate(..., num_replicates=30) so I can calculate mean/std dev of accuracy on a benchmark EvaluationDataset.
What I mean by replicates is basically running the task N times in parallel, and computing aggregate metrics across the parallel runs.
Why is the feature important for you?
Statistical significance is important.
Additional context
I have a custom task, and am trying to compared trained models' performance on that task.
Hi I’ve implemented support for evaluate(..., num_replicates=N) as requested in this issue.
It wraps the existing evaluate() from RAGAS and runs it N times in parallel. It then returns statistical summaries like mean, std, median, min, max, and 95% confidence intervals for each metric.
Code: Vinisha-Projects/Evaluate_with_replicates-for-RAGAS
No API keys are hardcoded, and the implementation uses your existing dataset and metrics interface.
Let me know if you'd like me to submit a PR with this functionality!