haystack
haystack copied to clipboard
`ExtractedAnswer` missing `page_number` meta
Describe the bug
When using extractive QA, the context and context_offset of the answer is not set. That is something we can workaround by using the document content directly as discussed below.
However, the page number meta is still not set specified in the comments below.
Error message none
Expected behavior
~I expect context and context_offset to be set.~ I expect the page number added to the answer meta
Additional context We need this in deepset Cloud to migrate to Haystack 2
To Reproduce
- Run extractive answer pipeline from tutorial
- Answer doesn't include
context/context_offsetand also not the page number meta
FAQ Check
- [x] Have you had a look at our new FAQ page?
System:
- OS:
- GPU/CPU:
- Haystack version (commit or version number): 2.1
- DocumentStore:
- Reader:
- Retriever:
meta also shouldn't be empty, right?
Hey @wochinge thanks for raising this. As a heads up the document provided in the answer contains the needed text. But you are right that I don't think that a context is provided any longer and for now it was assumed to just use the full text of the retrieved document.
Is it hard to get the context back? I think that would avoid that we need to send the entire document over for the preview (would be a workaround though). Worst case we cut out the context from the document for ourselves 🤔
There is the document and the document offset. I would indeed suggest to cut out the context and set the context offset then. Can be a separate component that gets a list of extracted answers and creates the context based on the desired length of the context.
@julian-risch @sjrl And what about the meta field? That one should come from the retrieved document, no?
I see here that the 1.x reader added the document page number to the meta field of the Answer based on the document offset. We could add the same functionality to the 2.x reader if that helps with the migration. What do you think? https://github.com/deepset-ai/haystack/blob/6d320f6929713d4b7664c5f3ae97ddd5e4b60bf0/haystack/nodes/reader/farm.py#L960
I would suggest that we basically copy over _add_answer_page_number https://github.com/deepset-ai/haystack/blob/6d320f6929713d4b7664c5f3ae97ddd5e4b60bf0/haystack/nodes/reader/farm.py#L952C5-L971C22 and put it here: https://github.com/deepset-ai/haystack/blob/b12e0db134277b3bf6b22471433dff385488decd/haystack/components/readers/extractive.py#L346
That would be great 🙌🏻
Other than the suggested meta info that @julian-risch suggested, I'd recommend getting everything else that is needed e.g. document_id from the document.meta object directly.
Also @wochinge and @julian-risch as a heads up we will need to also resolve https://github.com/deepset-ai/haystack/issues/6705 to be able to add the page_number to the answer.
Additionally, page number counting in general relies on page breaks \f being present in the Haystack Documents. In addition to the above issue about the DocumentSplitter we would also need to make sure our File Converters (e.g. PyPDF Converter) adds/preserves the page break information. For example, see this PR https://github.com/deepset-ai/haystack/pull/6755 where I added the page break information to the PyPDFToDocument converter.