docling icon indicating copy to clipboard operation
docling copied to clipboard

Docling having issue processing this font in pdf

Open Arslan-Mehmood1 opened this issue 1 year ago • 5 comments

Bug

PDF - font

image

Markdown Results image

...

Docling version

2.8.0 ...

Python version

... 3.10.12

Arslan-Mehmood1 avatar Nov 28 '24 18:11 Arslan-Mehmood1

@Arslan-Mehmood1 Some PDFs simply have garbled text layers like these, with no rescue. Some strategies that could help:

  1. Check what you get when using our docling-parse-v2 or our pypdfium PDF backends
  2. Enable force OCR, such that the full document is treated with OCR instead of relying on the PDF backend output

cau-git avatar Nov 29 '24 12:11 cau-git

@cau-git Thanks man. I'll test and report back here.

Arslan-Mehmood1 avatar Nov 29 '24 12:11 Arslan-Mehmood1

  1. Check what you get when using our docling-parse-v2 or our pypdfium PDF backends

@cau-git Is there a general recommendation which of the two backend perform better in most cases? Is there some kind of documentation where you discuss the differences/tradeoffs between the two backends?

simonschoe avatar Nov 30 '24 14:11 simonschoe

in case any one needs the link to documentation containing all different methods of inference for docling: https://ds4sd.github.io/docling/examples/full_page_ocr/

Arslan-Mehmood1 avatar Dec 02 '24 07:12 Arslan-Mehmood1

@cau-git thanks for help, I used following config for docling inference and the issue got resolved.

# Set up the pipeline options for PDF conversion
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure=True
pipeline_options.table_structure_options.do_cell_matching = True  # uses text cells predicted from table structure model
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE

# Any of the OCR options can be used:EasyOcrOptions, TesseractOcrOptions, TesseractCliOcrOptions, OcrMacOptions(Mac only), RapidOcrOptions
ocr_options = EasyOcrOptions(force_full_page_ocr=True)
# ocr_options = TesseractOcrOptions(force_full_page_ocr=True)
# ocr_options = OcrMacOptions(force_full_page_ocr=True)
# ocr_options = RapidOcrOptions(force_full_page_ocr=True)
# ocr_options = TesseractCliOcrOptions(force_full_page_ocr=True)
pipeline_options.ocr_options = ocr_options

Arslan-Mehmood1 avatar Dec 02 '24 08:12 Arslan-Mehmood1