Side notes:

Quick thanks to all the devs who've worked on LlamaIndex. It has been instrumental in supercharging my ability to build 💪
I'd love any feedback that I can iterate on, to help get this merged

Description

HitRate edit (HitRate@K)

Changed to allow for non-binary scoring --> To widen the evaluation value of this metric, from my perspective.
Example: 5/10 relevant docs retrieved scores 0.5 (instead of 1.0) and 2/10 would be 0.2 (instead of 1.0) --> Allowing for a much more detailed analysis of the amount of expected documents retrieved

RR edit

RR: The original implementation of MRR breaks out after one reciprocal rank calculation. So I renamed it RR for clarity

MRR edit (MRR@K)

MRR: MRR now calculates mean reciprocal rank score for all retrieved docs within the single call.
Example: 2 of of 3 docs retrieved are relevant, and are at the 1st and 3rd spot. This scores ( (1 + 1/3) / 2 = 0.67)
Idea - New name? More precisely could be called single query mean reciprocal rank (SQMRR) for clarity

Other changes made

Set data type used with expected ids, to increase speed of membership checks vs. a list (only in HitRate and MRR) --> Future implementation note: Should NOT be used in any metric where the order of the expected ids matters, since casting a list to a set may change the order of the elements
Error handling to also catch if empty lists are passed as args, for retrieved IDs or expected IDs (raises ValueError)
Removed unused parameters
Added RR to metric registry

Type of Change

[x] New feature (non-breaking change which adds functionality)
[x] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[x] This change requires a documentation update

How Has This Been Tested?

[x] Added new unit/integration tests
[ ] Added new notebook (that tests end-to-end)
[x] I stared at the code and made sure it makes sense

Suggested Checklist:

[x] I have performed a self-review of my own code
[x] I have commented my code, particularly in hard-to-understand areas
[ ] I have made corresponding changes to the documentation
[ ] I have added Google Colab support for the newly added notebooks.
[x] My changes generate no new warnings
[x] I have added tests that prove my fix is effective or that my feature works
[x] New and existing unit tests pass locally with my changes
[x] I ran make format; make lint to appease the lint gods

Apr 21 '24 13:04 AgenP

Thanks for the PR @AgenP. I have left some comments. I agree with you on RR / MRR, but I don't think that change is worth the breaking changes. I like the new "HitRate" metric, but we should leave the original one as is and create a net new one.

Hey @nerdai, thank you very much for the feedback. I’ve sent in my responses to your comments

Here is a quick summary of my proposed action steps

Revert any changes I've made to the compute method signature
Create a net new SQMRR metric for additional flexibility
Keep HitRate as the same metric, but with the new implementation (justification above)
Remove RR from metrics (including metric registry)

Let me know if these are all good to be implmented, or if there is room for improvement.

Apr 23 '24 10:04 AgenP

Thanks for the PR @AgenP. I have left some comments. I agree with you on RR / MRR, but I don't think that change is worth the breaking changes. I like the new "HitRate" metric, but we should leave the original one as is and create a net new one.

Hey @nerdai, thank you very much for the feedback. I’ve sent in my responses to your comments

Here is a quick summary of my proposed action steps

Revert any changes I've made to the compute method signature

Create a net new SQMRR metric for additional flexibility

Keep HitRate as the same metric, but with the new implementation (justification above)

Remove RR from metrics (including metric registry)

Let me know if these are all good to be implmented, or if there is room for improvement.

I think it makes sense :) . I left replies to your latest comments!

Apr 24 '24 03:04 nerdai

Hey @nerdai, the iterations have been made! Here is a summary of the changes I've now implemented

MRR and HitRate changes

compute method signatures are now the same as BaseRetrievalMetric
Both have a granular implementation option through usage of a kwarg
Detailed docstrings added to enhance explainability with these new changes

RR removed

Old proposed, separate RR metric removed (both as a class, and from the metric registry)

Testing & Formatting

New unit tests all pass, no additional warnings generated
Formatting/Linting handled by pre-commit hooks

Lmk if there are any issues 💪

Apr 26 '24 07:04 AgenP

Hey @nerdai, the iterations have been made! Here is a summary of the changes I've now implemented

MRR and HitRate changes

compute method signatures are now the same as BaseRetrievalMetric

Both have a granular implementation option through usage of a kwarg

Detailed docstrings added to enhance explainability with these new changes

RR removed

Old proposed, separate RR metric removed (both as a class, and from the metric registry)

Testing & Formatting

New unit tests all pass, no additional warnings generated

Formatting/Linting handled by pre-commit hooks

Lmk if there are any issues 💪

amazing thanks for the thorough summary @AgenP !

Apr 26 '24 15:04 nerdai

Looks very good! I just think that maybe instead of using kwargs on compute to learn if we should use granular compute or not, maybe we just make it an instance attribute

Thanks @nerdai!

My thinking with this was that since the BaseRetrievalMetric superclass' compute method also has **kwargs as a parameter, I thought that using a kwarg to distinguish the two implementations keeps the method signature the same.

What do you think?

Apr 29 '24 06:04 AgenP

Looks very good! I just think that maybe instead of using kwargs on compute to learn if we should use granular compute or not, maybe we just make it an instance attribute

Thanks @nerdai!

My thinking with this was that since the BaseRetrievalMetric superclass' compute method also has **kwargs as a parameter, I thought that using a kwarg to distinguish the two implementations keeps the method signature the same.

What do you think?

Sorry for the late reply!

Yeah, I just think its a bit odd that we have to tuck these args away in kwargs. I think its harder for the user to know about this option. So, a way to not break the superclass' compute signature is just to create a class/instance attribute for these params instead:


class HitRate(BaseRetrievalMetric):
    """Hit rate metric."""

    metric_name: str = "hit_rate"
    use_granular_mrr: bool = False

    ...

    def compute(...):
          if self.use_granular_mrr:
               ....

How about something like this?

Apr 30 '24 16:04 nerdai

Looks very good! I just think that maybe instead of using kwargs on compute to learn if we should use granular compute or not, maybe we just make it an instance attribute

Thanks @nerdai! My thinking with this was that since the BaseRetrievalMetric superclass' compute method also has **kwargs as a parameter, I thought that using a kwarg to distinguish the two implementations keeps the method signature the same. What do you think?

Sorry for the late reply!

Yeah, I just think its a bit odd that we have to tuck these args away in kwargs. I think its harder for the user to know about this option. So, a way to not break the superclass' compute signature is just to create a class/instance attribute for these params instead:
class HitRate(BaseRetrievalMetric):
    """Hit rate metric."""

    metric_name: str = "hit_rate"
    use_granular_mrr: bool = False

    ...

    def compute(...):
          if self.use_granular_mrr:
               ....
How about something like this?

No worries whatsoever

That makes complete sense. Thank you for highlighting this

The new iteration has been commit!

May 01 '24 07:05 AgenP

@AgenP thanks again -- merging this for us now :)

May 01 '24 23:05 nerdai

llama_index
llama_index copied to clipboard

Retrieval Metrics: Updating HitRate and MRR for Evaluation@K documents retrieved. Also adding RR as a separate metric

Description

Type of Change

How Has This Been Tested?

Suggested Checklist:

llama_index llama_index copied to clipboard

Retrieval Metrics: Updating HitRate and MRR for Evaluation@K documents retrieved. Also adding RR as a separate metric

Description

Type of Change

How Has This Been Tested?

Suggested Checklist:

llama_index
llama_index copied to clipboard