haystack feat: Add Lost In The Middle Ranker

Lost In The Middle Ranker

This ranker ranks documents based on the "Lost in the Middle" order, designed to position "the best" documents (low index in the given list of documents) at the beginning and the end of the resulting list while placing "the worst" documents (high index in the given list of documents) in the middle.

The Lost in the Middle Ranker contains these methods:

def __init__(self, word_count_threshold: Optional[int] = None, top_k: Optional[int] = None):

If 'word_count_threshold' is specified, this ranker includes all documents up until the point where adding another document would exceed the 'word_count_threshold'. The last document that causes the threshold to be breached will be included in the resulting list of documents, but all subsequent documents will be discarded.

def reorder_documents(self, documents: List[Document]) -> List[Document]:

Ranks documents based on the "lost in the middle" order. Assumes that all documents are ordered by relevance.

def run(self, query: str, documents: List[Document], top_k: Optional[int] = None) -> List[Document]:

Reranks documents based on the "lost in the middle" order. Returns a list of Documents reordered based on the input query.

The following units tests were written for:

The Lost In The Middle Ranker works with an odd number of documents.
The Lost In The Middle Ranker works with an even number of documents.
The Lost In The Middle Ranker works with two documents.
The Lost In The Middle Ranker initializes with default values.
The Lost In The Middle Ranker raises an error when word count threshold is <= 0
The Lost In The Middle Ranker with word count threshold works as expected.
Empty Documents will return a empty List.
One Document will return the same document
Tests that merging a list of non-textual documents raises a ValueError
Tests the lost in the middle order works with a odd number of documents and a top_k parameter.
Tests that the lost in the middle order works with an odd number of documents and an invalid top_k parameter.
Ranker Retreival Pipeline where a sparse retreiver and lost in the middle ranker is connected on three documents and the top 2 documents are retrieved.
RAG Pipeline on three documents where the emebddings are created and then the retrieved documents are ranked.
RAG Pipeline on the wikipedia dataset where the top 3 documents are returned after ranking.

Feb 15 '24 06:02 vrunm

This will close issue https://github.com/deepset-ai/haystack/issues/7011

Feb 16 '24 10:02 sjrl

Pull Request Test Coverage Report for Build 7978166240

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
53 unchanged lines in 10 files lost coverage.
Overall coverage increased (+0.4%) to 89.633%

Files with Coverage Reduction	New Missed Lines	%
components/others/multiplexer.py	1	88.46%
components/connectors/openapi_service.py	2	95.83%
components/embedders/sentence_transformers_document_embedder.py	2	95.56%
components/embedders/sentence_transformers_text_embedder.py	2	94.59%
components/evaluators/statistical_evaluator.py	3	96.74%
components/audio/whisper_local.py	6	91.3%
utils/auth.py	6	93.27%
components/rankers/transformers_similarity.py	8	91.11%
components/readers/extractive.py	8	95.71%
core/pipeline/pipeline.py	15	94.6%
<!--	Total:	53

Totals
Change from base Build 7920674498:	0.4%
Covered Lines:	5153
Relevant Lines:	5749

💛 - Coveralls

Feb 16 '24 16:02 coveralls

Hello @vrunm first of all thank you so much for working on this pull request! The test revealed two remaining mypy issues:

haystack/components/rankers/lost_in_the_middle.py:62: error: Item "None" of "Optional[str]" has no attribute "split"  [union-attr]
haystack/components/rankers/lost_in_the_middle.py:78: error: Item "None" of "Optional[str]" has no attribute "split"  [union-attr]

If you need help with those just let me know and I will provide the fix.

Feb 19 '24 08:02 julian-risch

@julian-risch It would greatly help if you can suggest a way to tackle the mypy issue. Thanks for your help.

Feb 19 '24 08:02 vrunm

@vrunm The mypy issue occured because we ran content.split() in the code but content can be of type string or None. So I added a check that content is not None and in one line of code I added a # type: ignore[union-attr] in addition to that. I also removed a few test cases because they are more like end-to-end test cases that we don't need here. One of them could be a candidate for our e2e test folder. Thanks for contributing this PR! I will merge it next.

Feb 20 '24 16:02 julian-risch