ragas
ragas copied to clipboard
[R-259] Which is the best LLM for evaluation?
I checked the documentation and related resources and couldn't find an answer to my question.
Your Question Do RAGAS prompts work equally well with other LLMs like Claude 3 Sonnet and Llama 3? If not which model to choose? Also, is there a way to print and modify the prompts?
Additional context
- I can see a big variation in scores across different models.
- GPT-3.5 gives me higher scores compared to others for all metrics. Claude 3 Sonnet is like the average of Llama 3 and GPT-4 Turbo. Llama 3 and Cohere Command often give NaN output for some metrics.
- My evaluation set has 19 records.
Radar chart comparison of scores: