docling icon indicating copy to clipboard operation
docling copied to clipboard

Text in PDF Recognized as Image Instead of Text During Parsing

Open kurekj opened this issue 1 year ago • 2 comments

image_000001_0a70fe332b988b47c6e4b59e8f4c6edbcba45055cc60c5293ff72f86bf82544c

Question

When parsing CVs using Docling on Ubuntu with Python 3.11, some portions of the PDF (e.g., containing text) are incorrectly treated as images instead of being recognized as text. This occurs despite enabling OCR and trying different OCR engines and settings.

Environment: Docling version: 2.10.0 Docling-Parse version: 3.0.0 Docling-Core version: 2.9.0 Operating System: Ubuntu Python version: 3.11

Relevant Code: IMAGE_RESOLUTION_SCALE = 10.0

pipeline_options = PdfPipelineOptions()
#pipeline_options = PdfPipelineOptions(backend=DoclingParseV2DocumentBackend)
#pipeline_options = PdfPipelineOptions(backend=DoclingParseV2PageBackend)

pipeline_options.do_ocr = True
#pipeline_options.do_table_structure = True
#pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE  # use more accurate TableFormer model
#pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE
pipeline_options.generate_page_images = True
pipeline_options.generate_picture_images = True
#pipeline_options.ocr_options.bitmap_area_threshold=0.05

# Any of the OCR options can be used:EasyOcrOptions, TesseractOcrOptions, TesseractCliOcrOptions, OcrMacOptions(Mac only), RapidOcrOptions
#ocr_options = EasyOcrOptions(force_full_page_ocr=True)
# ocr_options = TesseractOcrOptions(force_full_page_ocr=True)
# ocr_options = OcrMacOptions(force_full_page_ocr=True)
#ocr_options = RapidOcrOptions(force_full_page_ocr=True)
#ocr_options = TesseractCliOcrOptions(force_full_page_ocr=True)
#pipeline_options.ocr_options = ocr_options

doc_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

kurekj avatar Dec 11 '24 12:12 kurekj

@kurekj Could you please check this again with docling 2.14.0 and report if you see this still? A lot of things changed in the layout processing since.

cau-git avatar Dec 18 '24 10:12 cau-git

@cau-git unfortunately still the same :( .... Docling version: 2.14.0 Docling Core version: 2.12.1 Docling IBM Models version: 3.1.0 Docling Parse version: 3.0.0

kurekj avatar Dec 18 '24 12:12 kurekj