llama_index icon indicating copy to clipboard operation
llama_index copied to clipboard

Retrieval Metrics: Updating HitRate and MRR for Evaluation@K documents retrieved. Also adding RR as a separate metric

Open AgenP opened this issue 10 months ago • 4 comments

Side notes:

  1. Quick thanks to all the devs who've worked on LlamaIndex. It has been instrumental in supercharging my ability to build 💪
  2. I'd love any feedback that I can iterate on, to help get this merged

Description

HitRate edit (HitRate@K)

  • Changed to allow for non-binary scoring --> To widen the evaluation value of this metric, from my perspective.

  • Example: 5/10 relevant docs retrieved scores 0.5 (instead of 1.0) and 2/10 would be 0.2 (instead of 1.0) --> Allowing for a much more detailed analysis of the amount of expected documents retrieved

RR edit

  • RR: The original implementation of MRR breaks out after one reciprocal rank calculation. So I renamed it RR for clarity

MRR edit (MRR@K)

  • MRR: MRR now calculates mean reciprocal rank score for all retrieved docs within the single call.
  • Example: 2 of of 3 docs retrieved are relevant, and are at the 1st and 3rd spot. This scores ( (1 + 1/3) / 2 = 0.67)
  • Idea - New name? More precisely could be called single query mean reciprocal rank (SQMRR) for clarity

Other changes made

  • Set data type used with expected ids, to increase speed of membership checks vs. a list (only in HitRate and MRR) --> Future implementation note: Should NOT be used in any metric where the order of the expected ids matters, since casting a list to a set may change the order of the elements
  • Error handling to also catch if empty lists are passed as args, for retrieved IDs or expected IDs (raises ValueError)
  • Removed unused parameters
  • Added RR to metric registry

Type of Change

  • [x] New feature (non-breaking change which adds functionality)
  • [x] Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • [x] This change requires a documentation update

How Has This Been Tested?

  • [x] Added new unit/integration tests
  • [ ] Added new notebook (that tests end-to-end)
  • [x] I stared at the code and made sure it makes sense

Suggested Checklist:

  • [x] I have performed a self-review of my own code
  • [x] I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have added Google Colab support for the newly added notebooks.
  • [x] My changes generate no new warnings
  • [x] I have added tests that prove my fix is effective or that my feature works
  • [x] New and existing unit tests pass locally with my changes
  • [x] I ran make format; make lint to appease the lint gods

AgenP avatar Apr 21 '24 13:04 AgenP

Thanks for the PR @AgenP. I have left some comments. I agree with you on RR / MRR, but I don't think that change is worth the breaking changes. I like the new "HitRate" metric, but we should leave the original one as is and create a net new one.

Hey @nerdai, thank you very much for the feedback. I’ve sent in my responses to your comments

Here is a quick summary of my proposed action steps

  1. Revert any changes I've made to the compute method signature
  2. Create a net new SQMRR metric for additional flexibility
  3. Keep HitRate as the same metric, but with the new implementation (justification above)
  4. Remove RR from metrics (including metric registry)

Let me know if these are all good to be implmented, or if there is room for improvement.

AgenP avatar Apr 23 '24 10:04 AgenP

Thanks for the PR @AgenP. I have left some comments. I agree with you on RR / MRR, but I don't think that change is worth the breaking changes. I like the new "HitRate" metric, but we should leave the original one as is and create a net new one.

Hey @nerdai, thank you very much for the feedback. I’ve sent in my responses to your comments

Here is a quick summary of my proposed action steps

  1. Revert any changes I've made to the compute method signature
  2. Create a net new SQMRR metric for additional flexibility
  3. Keep HitRate as the same metric, but with the new implementation (justification above)
  4. Remove RR from metrics (including metric registry)

Let me know if these are all good to be implmented, or if there is room for improvement.

I think it makes sense :) . I left replies to your latest comments!

nerdai avatar Apr 24 '24 03:04 nerdai

Hey @nerdai, the iterations have been made! Here is a summary of the changes I've now implemented

MRR and HitRate changes

  • compute method signatures are now the same as BaseRetrievalMetric
  • Both have a granular implementation option through usage of a kwarg
  • Detailed docstrings added to enhance explainability with these new changes

RR removed

  • Old proposed, separate RR metric removed (both as a class, and from the metric registry)

Testing & Formatting

  • New unit tests all pass, no additional warnings generated
  • Formatting/Linting handled by pre-commit hooks

Lmk if there are any issues 💪

AgenP avatar Apr 26 '24 07:04 AgenP

Hey @nerdai, the iterations have been made! Here is a summary of the changes I've now implemented

MRR and HitRate changes

  • compute method signatures are now the same as BaseRetrievalMetric
  • Both have a granular implementation option through usage of a kwarg
  • Detailed docstrings added to enhance explainability with these new changes

RR removed

  • Old proposed, separate RR metric removed (both as a class, and from the metric registry)

Testing & Formatting

  • New unit tests all pass, no additional warnings generated
  • Formatting/Linting handled by pre-commit hooks

Lmk if there are any issues 💪

amazing thanks for the thorough summary @AgenP !

nerdai avatar Apr 26 '24 15:04 nerdai

Looks very good! I just think that maybe instead of using kwargs on compute to learn if we should use granular compute or not, maybe we just make it an instance attribute

Thanks @nerdai!

My thinking with this was that since the BaseRetrievalMetric superclass' compute method also has **kwargs as a parameter, I thought that using a kwarg to distinguish the two implementations keeps the method signature the same.

What do you think?

AgenP avatar Apr 29 '24 06:04 AgenP

Looks very good! I just think that maybe instead of using kwargs on compute to learn if we should use granular compute or not, maybe we just make it an instance attribute

Thanks @nerdai!

My thinking with this was that since the BaseRetrievalMetric superclass' compute method also has **kwargs as a parameter, I thought that using a kwarg to distinguish the two implementations keeps the method signature the same.

What do you think?

Sorry for the late reply!

Yeah, I just think its a bit odd that we have to tuck these args away in kwargs. I think its harder for the user to know about this option. So, a way to not break the superclass' compute signature is just to create a class/instance attribute for these params instead:


class HitRate(BaseRetrievalMetric):
    """Hit rate metric."""

    metric_name: str = "hit_rate"
    use_granular_mrr: bool = False

    ...

    def compute(...):
          if self.use_granular_mrr:
               ....

How about something like this?

nerdai avatar Apr 30 '24 16:04 nerdai

Looks very good! I just think that maybe instead of using kwargs on compute to learn if we should use granular compute or not, maybe we just make it an instance attribute

Thanks @nerdai! My thinking with this was that since the BaseRetrievalMetric superclass' compute method also has **kwargs as a parameter, I thought that using a kwarg to distinguish the two implementations keeps the method signature the same. What do you think?

Sorry for the late reply!

Yeah, I just think its a bit odd that we have to tuck these args away in kwargs. I think its harder for the user to know about this option. So, a way to not break the superclass' compute signature is just to create a class/instance attribute for these params instead:

class HitRate(BaseRetrievalMetric):
    """Hit rate metric."""

    metric_name: str = "hit_rate"
    use_granular_mrr: bool = False

    ...

    def compute(...):
          if self.use_granular_mrr:
               ....

How about something like this?

No worries whatsoever

That makes complete sense. Thank you for highlighting this

The new iteration has been commit!

AgenP avatar May 01 '24 07:05 AgenP

@AgenP thanks again -- merging this for us now :)

nerdai avatar May 01 '24 23:05 nerdai