docling Standalone version of EasyOCR giving much better result than using EasyOCR in docling [ tested with Vietnamese ]

Bug

It's wired, when i use easyocr on huggingface or demo site, the result is much better than with docling. Do not understand what happened, but i am trying debug code and also want to know the answer from developers?

Reproduce

Here is my code

from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption

pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.ocr_options.use_gpu = False
pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.ocr_options.lang=["vi"]

doc_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=pipeline_options, backend=PyPdfiumDocumentBackend
        ),
        InputFormat.IMAGE: PdfFormatOption(
            pipeline_options=pipeline_options, backend=PyPdfiumDocumentBackend
        )
    }
)

test2 = doc_converter.convert("/home/jonas/Pictures/Screenshots/1.png")
print(test2.document.export_to_dict())

I/O:

Input
Output in docling: https://justpaste.it/gmk92

The sample given text (not acurrated) 'orig': 'BỌ LAO ĐỘNG THƯƠNG BINH VÀ XÃ HỌI', 'text': 'BỌ LAO ĐỘNG THƯƠNG BINH VÀ XÃ HỌI', 'level': 1},

EasyOcR demo: https://www.jaided.ai/easyocr The given text is more accurated

Docling version

2.7.0

Python version

3.10

EasyOCR version

1.7.2

Nov 25 '24 11:11 jonaskahn

@jonaskahn could you please try the input image again with the latest version of Docling and highlight where you think there are discrepancies between the output of Docling and the output of EasyOCR.

Dec 11 '24 14:12 nikos-livathinos

I tried with the latest version, result still the same LEFT SIDE: DOCLING ( less incorrect ) RIGHT SIDE: EASYOCR (more correct)

Dec 16 '24 11:12 jonaskahn

@jonaskahn I re-checked this, and I can see that many of the predicted text cells in EasyOCR come out with very low confidence. Can you please give a minimal code to run it through EasyOCR natively?

In the meanwhile I checked with another OCR engine supported in docling (ocrmac, works only on macOS), and I get this result:

docling --to html --to json --ocr-lang "vi-VT" --ocr-engine ocrmac test.png

Dec 18 '24 11:12 cau-git

Closing this because of inactivity. Please feel free to reopen if there is further demand.

May 20 '25 18:05 cau-git

Please check again for a long time, the re

sults still contain many unknown words, for example as follows

May 21 '25 10:05 Hunglmc