docling icon indicating copy to clipboard operation
docling copied to clipboard

Arabic OCR is not working

Open allaabdella2 opened this issue 1 year ago • 3 comments

I used the code below to parse an Arabic documents:

pipeline_options = PdfPipelineOptions() pipeline_options.do_ocr=True

pipeline_options.do_table_structure=True pipeline_options.table_structure_options.do_cell_matching = True options = TesseractOcrOptions() options.lang = ['eng', 'ara'] pipeline_options.ocr_options = options

doc_converter = DocumentConverter( allowed_formats=[ InputFormat.PDF,

    ],
format_options={
    InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options, 
                                      backend=PyPdfiumDocumentBackend),

}

)

Here is the results: Completely off:

image

allaabdella2 avatar Dec 16 '24 02:12 allaabdella2

@Alla-Abdella Can you please re-check this after adding options.force_full_page_ocr = True? We need to be sure it is actually using OCR and not preferring content encoded in the PDF.

cau-git avatar Dec 16 '24 07:12 cau-git

@Alla-Abdella have you installed the tesseract languages pack? https://tesseract-ocr.github.io/tessdoc/Installation.html

nikos-livathinos avatar Dec 16 '24 08:12 nikos-livathinos

@cau-git ValueError: "TesseractOcrOptions" object has no field "force_full_page_ocr"

allaabdella2 avatar Dec 16 '24 17:12 allaabdella2

@allaabdella2 do you still have this issue? If yes, please provide some input document(s) to reproduce this behavior.

Also the TesseractOcrOptions does have the force_full_page_ocr field, as it is inherited by the base class OcrOptions.

nikos-livathinos avatar Jan 30 '25 09:01 nikos-livathinos

options.lang = ['eng', 'ara']

2

options.lang = ['en', 'ar']

werruww avatar May 11 '25 02:05 werruww

https://github.com/werruww/succ-docling-/blob/main/suc_docling%20(2).ipynb

werruww avatar May 11 '25 02:05 werruww

This should be working as suggested by @werruww. Please reopen if you still see issues.

cau-git avatar May 21 '25 14:05 cau-git