Handle vector-image-converted text in PDFs
Requested feature
Our users encounter from time to time documents that instead of text have vector path's representing themselves as text. Because of the vector nature of it, we do not automatically OCR such pages, and because there is no actual text in such PDF pages, conversion output is usually empty.
Alternatives
As a solution we need to reliably detect such cases, and render such pages into raster images, then running them through standard OCR pipeline. We can use layout model output to detect such cases, if layout model predicts text blocks that don't have programmatic text associated with them, this could be a good indication.
Hello, I am facing this exact challenge, I believe. I have a pdf file with some titles in some sort of "image" format that does not seem to be captured by the OCR, and is therefore not presented in the markdown text output. I'll try to find the file and post it here with the extraction output.
Hi @maxmnemonic
I am facing a similar issue but the behavior is different for markdown export and text export, while converting the following document
The markdown has repetition of letters in a word making it wrong, attaching the screenshot below
The text file does not contain the text, attaching the screenshot below
The code used for conversion is as follows: `def pdf_converter(source):
# PyPdfium with EasyOCR
# -----------------
pipeline_options = PdfPipelineOptions()
pipeline_options.generate_page_images = True
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE
doc_converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_options=pipeline_options, backend=PyPdfiumDocumentBackend
)
}
)
start_time = time.time()
conv_result = doc_converter.convert(source)
end_time = time.time() - start_time
# _log.info(f"Document converted in {end_time:.2f} seconds.")
## Export results
output_dir = Path("scratch")
output_dir.mkdir(parents=True, exist_ok=True)
doc_filename = conv_result.input.file.stem
# print("body", conv_result.document.body)
# Export Deep Search document JSON format:
with (output_dir / f"{doc_filename}.json").open("w", encoding="utf-8") as fp:
fp.write(json.dumps(conv_result.document.export_to_dict()))
# Export Text format:
with (output_dir / f"{doc_filename}.txt").open("w", encoding="utf-8") as fp:
fp.write(conv_result.document.export_to_text())
# Export Markdown format:
with (output_dir / f"{doc_filename}.md").open("w", encoding="utf-8") as fp:
fp.write(conv_result.document.export_to_markdown())
# Export Document Tags format:
with (output_dir / f"{doc_filename}.doctags").open("w", encoding="utf-8") as fp:
fp.write(conv_result.document.export_to_document_tokens())
`
Could you please explain what might be happening for the inconsistent results for different file types.
Hi @maxmnemonic I am facing a similar issue but the behavior is different for markdown export and text export, while converting the following document... Could you please explain what might be happening for the inconsistent results for different file types.
@deborah-drongoai, I believe this could be a different issue, but we can check it up, are you converting it from PDF, any chance you could share it with us?
Quick Update: the issue of skipping over text converted to vector images described here, can be handled also with forced full-page OCR, that is being prepared with this PR: feat(OCR): Introduce the OcrOptions.force_full_page_ocr parameter that forces a full page OCR scanning #290
Hi @maxmnemonic I am facing a similar issue but the behavior is different for markdown export and text export, while converting the following document... Could you please explain what might be happening for the inconsistent results for different file types.
@deborah-drongoai, I believe this could be a different issue, but we can check it up, are you converting it from PDF, any chance you could share it with us?
@maxmnemonic Thank you for your quick response, sure I will attach the pdf file that I used to work with the module. rajesh_1026319_20240917055115_stationerypdf_oh_merged.pdf It would be of great help if I could get some insights with this issue. Thanks of lot.