docling icon indicating copy to clipboard operation
docling copied to clipboard

Standalone version of EasyOCR giving much better result than using EasyOCR in docling [ tested with Vietnamese ]

Open jonaskahn opened this issue 1 year ago • 3 comments

Bug

It's wired, when i use easyocr on huggingface or demo site, the result is much better than with docling. Do not understand what happened, but i am trying debug code and also want to know the answer from developers?

Reproduce

Here is my code

from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption

pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.ocr_options.use_gpu = False
pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.ocr_options.lang=["vi"]

doc_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=pipeline_options, backend=PyPdfiumDocumentBackend
        ),
        InputFormat.IMAGE: PdfFormatOption(
            pipeline_options=pipeline_options, backend=PyPdfiumDocumentBackend
        )
    }
)

test2 = doc_converter.convert("/home/jonas/Pictures/Screenshots/1.png")
print(test2.document.export_to_dict())

I/O:

  • Input Test-image
  • Output in docling: https://justpaste.it/gmk92

The sample given text (not acurrated) 'orig': 'BỌ LAO ĐỘNG THƯƠNG BINH VÀ XÃ HỌI', 'text': 'BỌ LAO ĐỘNG THƯƠNG BINH VÀ XÃ HỌI', 'level': 1},

  • EasyOcR demo: https://www.jaided.ai/easyocr The given text is more accurated Easy OCR demo more accurate

Docling version

2.7.0

Python version

3.10

EasyOCR version

1.7.2


jonaskahn avatar Nov 25 '24 11:11 jonaskahn

@jonaskahn could you please try the input image again with the latest version of Docling and highlight where you think there are discrepancies between the output of Docling and the output of EasyOCR.

nikos-livathinos avatar Dec 11 '24 14:12 nikos-livathinos

I tried with the latest version, result still the same LEFT SIDE: DOCLING ( less incorrect ) RIGHT SIDE: EASYOCR (more correct) image

jonaskahn avatar Dec 16 '24 11:12 jonaskahn

@jonaskahn I re-checked this, and I can see that many of the predicted text cells in EasyOCR come out with very low confidence. Can you please give a minimal code to run it through EasyOCR natively?

In the meanwhile I checked with another OCR engine supported in docling (ocrmac, works only on macOS), and I get this result: image

docling --to html --to json --ocr-lang "vi-VT" --ocr-engine ocrmac test.png

cau-git avatar Dec 18 '24 11:12 cau-git

Closing this because of inactivity. Please feel free to reopen if there is further demand.

cau-git avatar May 20 '25 18:05 cau-git

Please check again for a long time, the re

Image

sults still contain many unknown words, for example as follows

Hunglmc avatar May 21 '25 10:05 Hunglmc