haystack icon indicating copy to clipboard operation
haystack copied to clipboard

feat: Add `failed_sources` to the output of converters

Open medsriha opened this issue 2 months ago • 5 comments

Is your feature request related to a problem? Please describe. It would be helpful if we could access the list of failed files so we can send them to another converter, such as OCR or similar. Ideally, this new feature would work for both PyPDFToDocument and PDFMinerToDocument.

Describe the solution you'd like Basically, when there is an exception, the failed files would be appended to a list, something like this:

  try:
      pdf_reader = PdfReader(io.BytesIO(bytestream.data))
      text = self._default_convert(pdf_reader)
  except Exception as e:
      logger.warning(
          "Could not read {source} and convert it to Document, skipping. {error}", source=source, error=e
      )
      failed_files.append(source)  # return this list along with `documents`
      continue

medsriha avatar Oct 03 '25 14:10 medsriha

hey @medsriha could you provide some examples of what errors you are seeing? If the error is related to corrupted files then I don't think this addition would necessarily help you.

sjrl avatar Oct 06 '25 07:10 sjrl

@sjrl, probably erroneous files were a bad example, but it could still be a good idea to direct them to a different path for troubleshooting. The reason for this ticket is that we noticed PyPDFToDocument was able to extract some text from a PDF, while PDFMinerToDocument was not. When this happens, it would be helpful to route the PDF to another converter, for example. This could also apply to empty PDF content, which might be scanned documents that need to be routed to OCR.

medsriha avatar Oct 06 '25 16:10 medsriha

When this happens, it would be helpful to route the PDF to another converter, for example. This could also apply to empty PDF content, which might be scanned documents that need to be routed to OCR.

Just as a heads up PDF's with empty content are still returned normally so adding this additional route wouldn't help identify such cases. You can use our relatively new DocumentLengthRouter to route documents under a certain length to a different branch like for OCR.

but it could still be a good idea to direct them to a different path for troubleshooting.

Yeah that does sound it could help in some specific scenarios. Could you provide some examples of the files that were causing problems?

sjrl avatar Oct 07 '25 06:10 sjrl

@sjrl I'll give DocumentLengthRouter a try. Just curious, what's the reason for not enabling a second list of empty/failed files in the converters? Also, I will share these files internally.

medsriha avatar Oct 07 '25 13:10 medsriha

@sjrl I'll give DocumentLengthRouter a try. Just curious, what's the reason for not enabling a second list of empty/failed files in the converters? Also, I will share these files internally.

I could see adding an output of failed files being useful, I just want to better understand what the failure cases are. If we can fix the failure case I'd like also like to tackle that.

sjrl avatar Oct 07 '25 14:10 sjrl