haystack icon indicating copy to clipboard operation
haystack copied to clipboard

`ExtractedAnswer` missing `page_number` meta

Open wochinge opened this issue 1 year ago • 10 comments

Describe the bug

When using extractive QA, the context and context_offset of the answer is not set. That is something we can workaround by using the document content directly as discussed below.

However, the page number meta is still not set specified in the comments below.

Error message none

Expected behavior ~I expect context and context_offset to be set.~ I expect the page number added to the answer meta

Additional context We need this in deepset Cloud to migrate to Haystack 2

To Reproduce

  1. Run extractive answer pipeline from tutorial
  2. Answer doesn't include context / context_offset and also not the page number meta

Screenshot 2024-04-22 at 13 56 37

FAQ Check

System:

  • OS:
  • GPU/CPU:
  • Haystack version (commit or version number): 2.1
  • DocumentStore:
  • Reader:
  • Retriever:

wochinge avatar Apr 22 '24 11:04 wochinge

meta also shouldn't be empty, right?

wochinge avatar Apr 22 '24 12:04 wochinge

Hey @wochinge thanks for raising this. As a heads up the document provided in the answer contains the needed text. But you are right that I don't think that a context is provided any longer and for now it was assumed to just use the full text of the retrieved document.

sjrl avatar Apr 22 '24 12:04 sjrl

Is it hard to get the context back? I think that would avoid that we need to send the entire document over for the preview (would be a workaround though). Worst case we cut out the context from the document for ourselves 🤔

wochinge avatar Apr 22 '24 12:04 wochinge

There is the document and the document offset. I would indeed suggest to cut out the context and set the context offset then. Can be a separate component that gets a list of extracted answers and creates the context based on the desired length of the context.

julian-risch avatar Apr 22 '24 12:04 julian-risch

@julian-risch @sjrl And what about the meta field? That one should come from the retrieved document, no?

wochinge avatar Apr 22 '24 12:04 wochinge

I see here that the 1.x reader added the document page number to the meta field of the Answer based on the document offset. We could add the same functionality to the 2.x reader if that helps with the migration. What do you think? https://github.com/deepset-ai/haystack/blob/6d320f6929713d4b7664c5f3ae97ddd5e4b60bf0/haystack/nodes/reader/farm.py#L960

I would suggest that we basically copy over _add_answer_page_number https://github.com/deepset-ai/haystack/blob/6d320f6929713d4b7664c5f3ae97ddd5e4b60bf0/haystack/nodes/reader/farm.py#L952C5-L971C22 and put it here: https://github.com/deepset-ai/haystack/blob/b12e0db134277b3bf6b22471433dff385488decd/haystack/components/readers/extractive.py#L346

julian-risch avatar Apr 22 '24 12:04 julian-risch

That would be great 🙌🏻

wochinge avatar Apr 22 '24 12:04 wochinge

Other than the suggested meta info that @julian-risch suggested, I'd recommend getting everything else that is needed e.g. document_id from the document.meta object directly.

sjrl avatar Apr 22 '24 13:04 sjrl

Also @wochinge and @julian-risch as a heads up we will need to also resolve https://github.com/deepset-ai/haystack/issues/6705 to be able to add the page_number to the answer.

sjrl avatar Apr 25 '24 06:04 sjrl

Additionally, page number counting in general relies on page breaks \f being present in the Haystack Documents. In addition to the above issue about the DocumentSplitter we would also need to make sure our File Converters (e.g. PyPDF Converter) adds/preserves the page break information. For example, see this PR https://github.com/deepset-ai/haystack/pull/6755 where I added the page break information to the PyPDFToDocument converter.

sjrl avatar Apr 25 '24 06:04 sjrl