llama_index
llama_index copied to clipboard
Retrieval Metrics: Updating HitRate and MRR for Evaluation@K documents retrieved. Also adding RR as a separate metric
Side notes:
- Quick thanks to all the devs who've worked on LlamaIndex. It has been instrumental in supercharging my ability to build 💪
- I'd love any feedback that I can iterate on, to help get this merged
Description
HitRate edit (HitRate@K)
-
Changed to allow for non-binary scoring --> To widen the evaluation value of this metric, from my perspective.
-
Example: 5/10 relevant docs retrieved scores 0.5 (instead of 1.0) and 2/10 would be 0.2 (instead of 1.0) --> Allowing for a much more detailed analysis of the amount of expected documents retrieved
RR edit
- RR: The original implementation of MRR breaks out after one reciprocal rank calculation. So I renamed it RR for clarity
MRR edit (MRR@K)
- MRR: MRR now calculates mean reciprocal rank score for all retrieved docs within the single call.
- Example: 2 of of 3 docs retrieved are relevant, and are at the 1st and 3rd spot. This scores ( (1 + 1/3) / 2 = 0.67)
- Idea - New name? More precisely could be called single query mean reciprocal rank (SQMRR) for clarity
Other changes made
- Set data type used with expected ids, to increase speed of membership checks vs. a list (only in HitRate and MRR) --> Future implementation note: Should NOT be used in any metric where the order of the expected ids matters, since casting a list to a set may change the order of the elements
- Error handling to also catch if empty lists are passed as args, for retrieved IDs or expected IDs (raises ValueError)
- Removed unused parameters
- Added RR to metric registry
Type of Change
- [x] New feature (non-breaking change which adds functionality)
- [x] Breaking change (fix or feature that would cause existing functionality to not work as expected)
- [x] This change requires a documentation update
How Has This Been Tested?
- [x] Added new unit/integration tests
- [ ] Added new notebook (that tests end-to-end)
- [x] I stared at the code and made sure it makes sense
Suggested Checklist:
- [x] I have performed a self-review of my own code
- [x] I have commented my code, particularly in hard-to-understand areas
- [ ] I have made corresponding changes to the documentation
- [ ] I have added Google Colab support for the newly added notebooks.
- [x] My changes generate no new warnings
- [x] I have added tests that prove my fix is effective or that my feature works
- [x] New and existing unit tests pass locally with my changes
- [x] I ran
make format; make lint
to appease the lint gods
Thanks for the PR @AgenP. I have left some comments. I agree with you on RR / MRR, but I don't think that change is worth the breaking changes. I like the new "HitRate" metric, but we should leave the original one as is and create a net new one.
Hey @nerdai, thank you very much for the feedback. I’ve sent in my responses to your comments
Here is a quick summary of my proposed action steps
- Revert any changes I've made to the
compute
method signature - Create a net new SQMRR metric for additional flexibility
- Keep HitRate as the same metric, but with the new implementation (justification above)
- Remove RR from metrics (including metric registry)
Let me know if these are all good to be implmented, or if there is room for improvement.
Thanks for the PR @AgenP. I have left some comments. I agree with you on RR / MRR, but I don't think that change is worth the breaking changes. I like the new "HitRate" metric, but we should leave the original one as is and create a net new one.
Hey @nerdai, thank you very much for the feedback. I’ve sent in my responses to your comments
Here is a quick summary of my proposed action steps
- Revert any changes I've made to the
compute
method signature- Create a net new SQMRR metric for additional flexibility
- Keep HitRate as the same metric, but with the new implementation (justification above)
- Remove RR from metrics (including metric registry)
Let me know if these are all good to be implmented, or if there is room for improvement.
I think it makes sense :) . I left replies to your latest comments!
Hey @nerdai, the iterations have been made! Here is a summary of the changes I've now implemented
MRR and HitRate changes
-
compute
method signatures are now the same as BaseRetrievalMetric - Both have a granular implementation option through usage of a kwarg
- Detailed docstrings added to enhance explainability with these new changes
RR removed
- Old proposed, separate RR metric removed (both as a class, and from the metric registry)
Testing & Formatting
- New unit tests all pass, no additional warnings generated
- Formatting/Linting handled by pre-commit hooks
Lmk if there are any issues 💪
Hey @nerdai, the iterations have been made! Here is a summary of the changes I've now implemented
MRR and HitRate changes
compute
method signatures are now the same as BaseRetrievalMetric- Both have a granular implementation option through usage of a kwarg
- Detailed docstrings added to enhance explainability with these new changes
RR removed
- Old proposed, separate RR metric removed (both as a class, and from the metric registry)
Testing & Formatting
- New unit tests all pass, no additional warnings generated
- Formatting/Linting handled by pre-commit hooks
Lmk if there are any issues 💪
amazing thanks for the thorough summary @AgenP !
Looks very good! I just think that maybe instead of using kwargs on
compute
to learn if we should use granular compute or not, maybe we just make it an instance attribute
Thanks @nerdai!
My thinking with this was that since the BaseRetrievalMetric superclass' compute
method also has **kwargs
as a parameter, I thought that using a kwarg to distinguish the two implementations keeps the method signature the same.
What do you think?
Looks very good! I just think that maybe instead of using kwargs on
compute
to learn if we should use granular compute or not, maybe we just make it an instance attributeThanks @nerdai!
My thinking with this was that since the BaseRetrievalMetric superclass'
compute
method also has**kwargs
as a parameter, I thought that using a kwarg to distinguish the two implementations keeps the method signature the same.What do you think?
Sorry for the late reply!
Yeah, I just think its a bit odd that we have to tuck these args away in kwargs
. I think its harder for the user to know about this option. So, a way to not break the superclass' compute signature is just to create a class/instance attribute for these params instead:
class HitRate(BaseRetrievalMetric):
"""Hit rate metric."""
metric_name: str = "hit_rate"
use_granular_mrr: bool = False
...
def compute(...):
if self.use_granular_mrr:
....
How about something like this?
Looks very good! I just think that maybe instead of using kwargs on
compute
to learn if we should use granular compute or not, maybe we just make it an instance attributeThanks @nerdai! My thinking with this was that since the BaseRetrievalMetric superclass'
compute
method also has**kwargs
as a parameter, I thought that using a kwarg to distinguish the two implementations keeps the method signature the same. What do you think?Sorry for the late reply!
Yeah, I just think its a bit odd that we have to tuck these args away in
kwargs
. I think its harder for the user to know about this option. So, a way to not break the superclass' compute signature is just to create a class/instance attribute for these params instead:class HitRate(BaseRetrievalMetric): """Hit rate metric.""" metric_name: str = "hit_rate" use_granular_mrr: bool = False ... def compute(...): if self.use_granular_mrr: ....
How about something like this?
No worries whatsoever
That makes complete sense. Thank you for highlighting this
The new iteration has been commit!
@AgenP thanks again -- merging this for us now :)