feat: Add `failed_sources` to the output of converters
Is your feature request related to a problem? Please describe.
It would be helpful if we could access the list of failed files so we can send them to another converter, such as OCR or similar. Ideally, this new feature would work for both PyPDFToDocument and PDFMinerToDocument.
Describe the solution you'd like Basically, when there is an exception, the failed files would be appended to a list, something like this:
try:
pdf_reader = PdfReader(io.BytesIO(bytestream.data))
text = self._default_convert(pdf_reader)
except Exception as e:
logger.warning(
"Could not read {source} and convert it to Document, skipping. {error}", source=source, error=e
)
failed_files.append(source) # return this list along with `documents`
continue
hey @medsriha could you provide some examples of what errors you are seeing? If the error is related to corrupted files then I don't think this addition would necessarily help you.
@sjrl, probably erroneous files were a bad example, but it could still be a good idea to direct them to a different path for troubleshooting. The reason for this ticket is that we noticed PyPDFToDocument was able to extract some text from a PDF, while PDFMinerToDocument was not. When this happens, it would be helpful to route the PDF to another converter, for example. This could also apply to empty PDF content, which might be scanned documents that need to be routed to OCR.
When this happens, it would be helpful to route the PDF to another converter, for example. This could also apply to empty PDF content, which might be scanned documents that need to be routed to OCR.
Just as a heads up PDF's with empty content are still returned normally so adding this additional route wouldn't help identify such cases. You can use our relatively new DocumentLengthRouter to route documents under a certain length to a different branch like for OCR.
but it could still be a good idea to direct them to a different path for troubleshooting.
Yeah that does sound it could help in some specific scenarios. Could you provide some examples of the files that were causing problems?
@sjrl I'll give DocumentLengthRouter a try. Just curious, what's the reason for not enabling a second list of empty/failed files in the converters? Also, I will share these files internally.
@sjrl I'll give
DocumentLengthRoutera try. Just curious, what's the reason for not enabling a second list of empty/failed files in the converters? Also, I will share these files internally.
I could see adding an output of failed files being useful, I just want to better understand what the failure cases are. If we can fix the failure case I'd like also like to tackle that.