Exception while running node 'Reader':pop index out of range + solution
When I split documents with the preprocessor, currently being 256 split with 160 overlap, and pass those to the reader, I sometimes get the error Exception: Exception while running node 'Reader':pop index out of range. Looking into the stack trace, this happens when it finds an answer twice due to the overlap in the function deduplicate prediction in the reader/farm.py file.
After deeper digging I found the problem more precisely:
Here is your deduplicate_predictions function in farm.py:
I added 4 print statements which print the following when the error occurs: 4 3 1 4 3 2 4 3 3 4 3 3 4 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3
Basically: if an answer has more than 1 relevant overlap, it will try to pop the prediction with ans_idx more than once which leads to the index out of bounds error.
UPDATE:
By just adding the double break, it works. Not sure if this is the way it is supposed to be but atleast no errors for me
System:
- OS: using Google Colab
- GPU/CPU: T4
- Haystack version (commit or version number): latest
- DocumentStore: ElasticSearch
- Reader: deepset squad2
- Retriever: EmbeddingRetriever
@Koenlaermans Thank you for reporting this issue and investigating it! Your reasoning regarding how the reader trying to pop the prediction with ans_idx more than once leads to the error makes sense to me.
Regarding how to fix it: I think a better way to break out of the nested loop is to refactor this part of the code and refrain from using the if breaked: break.
To this end, we should refactor the code by creating a new function that contains the outer loop and everything in it. The line breaked = True can then become return. This way, we can break out of the outer loop with the return statement. I would like to invite you to create a PR with this fix to contribute to Haystack! 🙂 A good starting point are our contributor guidelines here. If you don't have time to do it just let us know.
Sorry but I currently do not have time to figure out how to do PRs and such. I have a thesis deadline in the 1.5 week. THanks for the feedback though and for helping!
Alright, good luck with your thesis! 🎓
I have written a minimal example to reproduce the problem.
from haystack import Document
from haystack.document_stores import InMemoryDocumentStore
from haystack.nodes import FARMReader, BM25Retriever, PreProcessor
from haystack.pipelines import ExtractiveQAPipeline
from haystack.utils import print_answers
document_store = InMemoryDocumentStore(use_bm25=True)
preprocessor = PreProcessor(
clean_empty_lines=True,
clean_whitespace=True,
clean_header_footer=False,
split_by="word",
split_length=6,
split_overlap=5,
split_respect_sentence_boundary=False,
)
documents = [
Document(content="the second document adresseses life and the third onw talks about the stars.", meta={"name": "doc1"}),
Document(content="the second document adresseses life and the third one talks about the stars and the fifth is unseen.", meta={"name": "doc2"}),
]
processed_docs = preprocessor.process(documents)
document_store.write_documents(processed_docs)
# Initialize the reader using a pre-trained model
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=False)
# Instantiate a Finder object with the document store and reader
retriever = BM25Retriever(document_store=document_store)
#creating the retriever-reader pipeline
pipe = ExtractiveQAPipeline(reader, retriever)
#Ask a question
prediction = pipe.run(
query="What is the second document about?",
params={
"Retriever": {"top_k": 10},
"Reader": {"top_k": 5}
}
)
print_answers(
prediction,
details="minimum" ## Choose from `minimum`, `medium`, and `all`
)
When implementing the solution that Koenlaermans proposed. This is the output I get:
'Query: What is the second document about?'
'Answers:'
[ { 'answer': 'life and the',
'context': 'second document adresseses life and the'},
{ 'answer': 'life and',
'context': 'the second document adresseses life and'},
{'answer': 'the stars', 'context': 'one talks about the stars and'},
{'answer': 'the stars', 'context': 'about the stars and the fifth'},
{'answer': 'the', 'context': 'the third one talks about the'}]
Is this the expected output? To me the answers look redundant?