Port evaluation from 1.x and extend to LLMs

Open julian-risch opened this issue 1 year ago • 0 comments

We need to extend evaluation features, in particular for RAG pipelines so that users can answer questions like:

Is this pipeline good enough?
What should I focus on for optimization?
Is pipeline A better than B? (performance, costs, latency)

This includes the following components typically appearing in RAG pipelines: Retrievers, Rankers, DocumentJoiners a) labels available => statistical metrics b) no labels available => model based heuristics / pseudo label generator

Generators a) labels available = model based (SAS, answer correctness ...) b) no labels = model based (groundedness score)

### Tasks
- [ ] https://github.com/deepset-ai/haystack/issues/6061
- [ ] https://github.com/deepset-ai/haystack/issues/6786

Jan 02 '24 11:01 julian-risch