`evaluate` supporting replicates

Open jamesbraza opened this issue 7 months ago • 1 comments

Describe the Feature

It would be nice to do something like evaluate(..., num_replicates=30) so I can calculate mean/std dev of accuracy on a benchmark EvaluationDataset.

What I mean by replicates is basically running the task N times in parallel, and computing aggregate metrics across the parallel runs.

Why is the feature important for you?

Statistical significance is important.

Additional context

I have a custom task, and am trying to compared trained models' performance on that task.

May 21 '25 06:05 jamesbraza

Hi I’ve implemented support for evaluate(..., num_replicates=N) as requested in this issue.

It wraps the existing evaluate() from RAGAS and runs it N times in parallel. It then returns statistical summaries like mean, std, median, min, max, and 95% confidence intervals for each metric.

Code: Vinisha-Projects/Evaluate_with_replicates-for-RAGAS

No API keys are hardcoded, and the implementation uses your existing dataset and metrics interface.

Let me know if you'd like me to submit a PR with this functionality!

— @Vinisha-Projects

May 23 '25 02:05 Vinisha-projects