haystack Exception while running node 'Reader':pop index out of range + solution

When I split documents with the preprocessor, currently being 256 split with 160 overlap, and pass those to the reader, I sometimes get the error Exception: Exception while running node 'Reader':pop index out of range. Looking into the stack trace, this happens when it finds an answer twice due to the overlap in the function deduplicate prediction in the reader/farm.py file.

After deeper digging I found the problem more precisely: Here is your deduplicate_predictions function in farm.py:

I added 4 print statements which print the following when the error occurs: 4 3 1 4 3 2 4 3 3 4 3 3 4 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3

Basically: if an answer has more than 1 relevant overlap, it will try to pop the prediction with ans_idx more than once which leads to the index out of bounds error.

UPDATE: By just adding the double break, it works. Not sure if this is the way it is supposed to be but atleast no errors for me

System:

OS: using Google Colab
GPU/CPU: T4
Haystack version (commit or version number): latest
DocumentStore: ElasticSearch
Reader: deepset squad2
Retriever: EmbeddingRetriever

May 18 '23 18:05 Koenlaermans

@Koenlaermans Thank you for reporting this issue and investigating it! Your reasoning regarding how the reader trying to pop the prediction with ans_idx more than once leads to the error makes sense to me. Regarding how to fix it: I think a better way to break out of the nested loop is to refactor this part of the code and refrain from using the if breaked: break. To this end, we should refactor the code by creating a new function that contains the outer loop and everything in it. The line breaked = True can then become return. This way, we can break out of the outer loop with the return statement. I would like to invite you to create a PR with this fix to contribute to Haystack! 🙂 A good starting point are our contributor guidelines here. If you don't have time to do it just let us know.

May 24 '23 14:05 julian-risch

Sorry but I currently do not have time to figure out how to do PRs and such. I have a thesis deadline in the 1.5 week. THanks for the feedback though and for helping!

May 25 '23 14:05 Koenlaermans

Alright, good luck with your thesis! 🎓

May 25 '23 14:05 julian-risch

I have written a minimal example to reproduce the problem.

from haystack import Document
from haystack.document_stores import InMemoryDocumentStore
from haystack.nodes import FARMReader, BM25Retriever, PreProcessor
from haystack.pipelines import ExtractiveQAPipeline
from haystack.utils import print_answers

document_store = InMemoryDocumentStore(use_bm25=True)

preprocessor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=False,
    split_by="word",
    split_length=6,
    split_overlap=5,
    split_respect_sentence_boundary=False,
)

documents = [
    Document(content="the second document adresseses life and the third onw talks about the stars.", meta={"name": "doc1"}),
    Document(content="the second document adresseses life and the third one talks about the stars and the fifth is unseen.", meta={"name": "doc2"}),
]
processed_docs = preprocessor.process(documents)

document_store.write_documents(processed_docs)
# Initialize the reader using a pre-trained model
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=False)

# Instantiate a Finder object with the document store and reader
retriever = BM25Retriever(document_store=document_store)


#creating the retriever-reader pipeline
pipe = ExtractiveQAPipeline(reader, retriever)

#Ask a question
prediction = pipe.run(
    query="What is the second document about?",
    params={
        "Retriever": {"top_k": 10},
        "Reader": {"top_k": 5}
    }
)

print_answers(
    prediction,
    details="minimum" ## Choose from `minimum`, `medium`, and `all`
)

When implementing the solution that Koenlaermans proposed. This is the output I get:

'Query: What is the second document about?'
'Answers:'
[   {   'answer': 'life and the',
        'context': 'second document adresseses life and the'},
    {   'answer': 'life and',
        'context': 'the second document adresseses life and'},
    {'answer': 'the stars', 'context': 'one talks about the stars and'},
    {'answer': 'the stars', 'context': 'about the stars and the fifth'},
    {'answer': 'the', 'context': 'the third one talks about the'}]

Is this the expected output? To me the answers look redundant?

Aug 10 '23 11:08 x110