Arabic OCR is not working
I used the code below to parse an Arabic documents:
pipeline_options = PdfPipelineOptions() pipeline_options.do_ocr=True
pipeline_options.do_table_structure=True pipeline_options.table_structure_options.do_cell_matching = True options = TesseractOcrOptions() options.lang = ['eng', 'ara'] pipeline_options.ocr_options = options
doc_converter = DocumentConverter( allowed_formats=[ InputFormat.PDF,
],
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options,
backend=PyPdfiumDocumentBackend),
}
)
Here is the results: Completely off:
@Alla-Abdella Can you please re-check this after adding options.force_full_page_ocr = True? We need to be sure it is actually using OCR and not preferring content encoded in the PDF.
@Alla-Abdella have you installed the tesseract languages pack? https://tesseract-ocr.github.io/tessdoc/Installation.html
@cau-git ValueError: "TesseractOcrOptions" object has no field "force_full_page_ocr"
@allaabdella2 do you still have this issue? If yes, please provide some input document(s) to reproduce this behavior.
Also the TesseractOcrOptions does have the force_full_page_ocr field, as it is inherited by the base class OcrOptions.
options.lang = ['eng', 'ara']
2
options.lang = ['en', 'ar']
https://github.com/werruww/succ-docling-/blob/main/suc_docling%20(2).ipynb
This should be working as suggested by @werruww. Please reopen if you still see issues.