haystack
haystack copied to clipboard
Port evaluation from 1.x and extend to LLMs
We need to extend evaluation features, in particular for RAG pipelines so that users can answer questions like:
- Is this pipeline good enough?
- What should I focus on for optimization?
- Is pipeline A better than B? (performance, costs, latency)
This includes the following components typically appearing in RAG pipelines: Retrievers, Rankers, DocumentJoiners a) labels available => statistical metrics b) no labels available => model based heuristics / pseudo label generator
Generators a) labels available = model based (SAS, answer correctness ...) b) no labels = model based (groundedness score)
### Tasks
- [ ] https://github.com/deepset-ai/haystack/issues/6061
- [ ] https://github.com/deepset-ai/haystack/issues/6786