[R-259] Which is the best LLM for evaluation?

Open yadavshashank opened this issue 1 year ago • 0 comments

I checked the documentation and related resources and couldn't find an answer to my question.

Your Question Do RAGAS prompts work equally well with other LLMs like Claude 3 Sonnet and Llama 3? If not which model to choose? Also, is there a way to print and modify the prompts?

Additional context

I can see a big variation in scores across different models.
GPT-3.5 gives me higher scores compared to others for all metrics. Claude 3 Sonnet is like the average of Llama 3 and GPT-4 Turbo. Llama 3 and Cohere Command often give NaN output for some metrics.
My evaluation set has 19 records.

Radar chart comparison of scores: ragas_radar_model_comp

_R-259

May 21 '24 12:05 yadavshashank