How to interpret the combination of metrics: context precision and the rest (real world example)
I ran ragas to evaluate my LangChain-powered chatbot (it's basically a QA chain with document retrieval) and I got the following results.
| question | ground_truth | faithfulness | answer_relevancy | context_recall | context_precision | context_relevancy |
|---|---|---|---|---|---|---|
| Q1 | GT1 | 1 | 0.813637991 | 1 | 0 | 0.002824859 |
| Q2 | GT2 | 1 | 0.835290922 | 0 | 0 | 0.002890173 |
| Q3 | GT3 | 1 | 0.882307479 | 1 | 0 | 0.002659574 |
| Q4 | GT4 | 1 | 0.844765424 | 0 | 0 | 0.01953125 |
| Q5 | GT5 | 1 | 0.889618083 | 1 | 0 | 0.017857143 |
Of course, the context_precision (another form of context_relevancy which will disappear I think, according to the docs) values are very low (aka horrible). So, I did some debugging to understand the intermediate calculations (I didn't grasp everything.. but I've got an idea), and I'm wondering how is this situation possible (this is how I interpret it, and correct if I'm wrong):
context_recall: 1.00 (can it retrieve all the relevant information required to answer the question: YES) context_precision: 0.00 (the signal to noise ration of retrieved context: -almost- everything retrieved is Noise)
For example, I checked that for one answer, this is how the context precision metric evaluated the 2 retrieved documents:
[[ChatGeneration(text='No.', generation_info={'finish_reason': 'stop'}, message=AIMessage(content='No.'))]
Yet, the faithfullness is 1 and the answer relevancy is 0.81.. I'm really confused.. maybe I miss something, but I'd like to understand how to interpret not only each metric independently, but the combinations of their values and what they entail.
Thank you,
I'm also wondering if this is a "side effect" of the (relatively) long chunks of my docs ? (around 500 tokens).. I don't know if this also impacts the calculation..
@shahules786 : could you please take a look on this please?
Hi @younes-io , this is an interesting but weird result. Will you be able to share a subset of your data so that I can understand well what's going on?
@shahules786 I'm afraid I can't share that since it's private data..
Basically, I have document chunks (say 2) returned by OpenSearch, which contain the answer to the question. The first document contains the response, the second contains a small portion of the answer. The second document is larger than the first.
I'm just wondering if ragas takes into account the ratio of "relevance to the question / length of the context" in its calculations of context_precision..
@shahules786 : I have tested using the example in ragas docs
So, I used this dataset:
from datasets import load_dataset
fiqa_eval = load_dataset("explodinggradients/fiqa", "ragas_eval")
fiqa_eval
and here's the result:
| question | contexts | answer | ground_truths | context_precision | faithfulness | answer_relevancy | context_recall | context_relevancy | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | How to deposit a cheque issued to an associate... | [Just have the associate sign the back and the... | \nThe best way to deposit a cheque issued to a... | [Have the check reissued to the proper payee.J... | 0.0 | 1.0 | 0.938239 | 0.875 | 0.058824 |
| 1 | Can I send a money order from USPS as a business? | [Sure you can. You can fill in whatever you w... | \nYes, you can send a money order from USPS as... | [Sure you can. You can fill in whatever you w... | 0.0 | 0.8 | 0.885277 | 1.000 | 0.285714 |
| 2 | 1 EIN doing business under multiple business n... | [You're confusing a lot of things here. Compan... | \nYes, it is possible to have one EIN doing bu... | [You're confusing a lot of things here. Compan... | 0.0 | 0.8 | 0.924754 | 0.000 | 0.083333 |
| 3 | Applying for and receiving business credit | [Set up a meeting with the bank that handles y... | \nApplying for and receiving business credit c... | ["I'm afraid the great myth of limited liabili... | 0.0 | 1.0 | 0.899104 | 0.500 | 0.333333 |
| 4 | 401k Transfer After Business Closure | [The time horizon for your 401K/IRA is essenti... | \nIf your employer has closed and you need to ... | [You should probably consult an attorney. Howe... | 0.0 | 0.6 | 0.853572 | 0.000 | 0.043478 |
The context_precision is "almost" always equal to zero (or holds a near-zero value).
N.B: in the docs, the context precision is not displayed.
@shahules786 : sorry for bothering you, is someone from the team / community able to help on this please ? Thank you
Hi @younes-io , apologies for the late reply. Can you share your ragas version and LLM used?
Also can you try out the same using latest ragas in main ? You can install from source using pip install git+https://github.com/explodinggradients/ragas
@younes-io If you're open for a short call, I would love to help in person. Please book a slot here (early next week)
@shahules786 no worries, I'm also very sorry for the very late reply.. Sure, I'll book a slot!