doctr icon indicating copy to clipboard operation
doctr copied to clipboard

[improvement] `.render()` isn't that robust - wrong ordered results

Open kripper opened this issue 2 months ago • 12 comments

Bug description

The default OCR model works very well, but the render() algorithm which converts coordinates to text positions is very buggy. This causes lines originally placed at the top to be positioned between other lines at the bottom, making the overall result unusable for LLM inference.

I wonder if you have considered reusing the algorithm implemented in Tesseract. They probably solved the same problem many years ago. And I also wonder why the Tesseract team is not integrating the doctr engine into Tesseract :-)

Good job! You are leading the OCR leaderboard.

I attached a sample .PDF file and a snippet to reproduce the problem. I checked other similar inactive issues, so I'm afraid rendering to text is currently not a hot topic :-( ...but how are we suposed to feed our hungry LLMs?

Code snippet to reproduce the bug

import argparse
import os
import json

from doctr.io import DocumentFile
from doctr.models import ocr_predictor

def convert_pdf_to_txt(input_pdf, output_txt):
  """
  Converts a PDF file to a text file using DocTR OCR.

  Args:
      input_pdf (str): Path to the input PDF file.
      output_txt (str): Path to the output text file.
  """

  print("Load pre-trained OCR model")
  model = ocr_predictor(pretrained=True)

  # Ensure input PDF exists
  if not os.path.exists(input_pdf):
    raise ValueError(f"Input PDF file '{input_pdf}' does not exist.")

  # Load the PDF document
  try:
    doc = DocumentFile.from_pdf(input_pdf)
  except Exception as e:
    raise ValueError(f"Error loading PDF '{input_pdf}': {e}")

  # Perform OCR and extract text
  try:
    result = model(doc)
    #exp = result.export()
    #text = json.dumps(exp)
    text = result.render()
  except Exception as e:
    raise ValueError(f"Error performing OCR on '{input_pdf}': {e}")

  # Write extracted text to output file
  with open(output_txt, 'w', encoding='utf-8') as f:
    f.write(text)

  print(f"PDF '{input_pdf}' converted to text file '{output_txt}'.")

if __name__ == "__main__":
  parser = argparse.ArgumentParser(description="Convert PDF to text using DocTR OCR")
  parser.add_argument("input_pdf", help="Path to the input PDF file")
  parser.add_argument("output_txt", help="Path to the output text file")
  args = parser.parse_args()

  convert_pdf_to_txt(args.input_pdf, args.output_txt)

Error traceback

No error

Environment

Linux, conda, python 3.9

Deep Learning backend

Default model. test-ocr.pdf

kripper avatar May 06 '24 15:05 kripper

Hi @kripper :wave:,

Thanks for reporting :)

The issue here is that page 2 & 3 contains small rotations could you give it a try with passing assume_straight_pages=False to the ocr_predictor instance ? :)

felixdittrich92 avatar May 08 '24 06:05 felixdittrich92

Predictor initiated with:

model = ocr_predictor(pretrained=True, assume_straight_pages=False)

But the probelm persists on page 1:

Notario y Conservador de Bienes Raices Licanten Vilma Beatriz Navarro
<--- "Reyes" SHOULD GO HERE
Certifico que el presente documento electronico es copia fiel e integra de
CERTIFICADO otorgado el 26 de Abril de 2024 reproducido en las siguientes

Reyes <-------- BUT WAS PLACED HERE

paginas.

Also note that the OCR'ed page (page 1) is a clean PDF page. The second page is an image and assume_straight_pages could help here.

kripper avatar May 08 '24 08:05 kripper

From what I have also seen, sometimes the models predict lines in the wrong block, even though their coordinates are correct. This is why the render() method returns the text mixed up, as it is only a bunch of nested for loops going over all the pages, blocks, lines and words. To get over it I did this, although it kind of messes up the line breaks, it preserves the order:

def sort_by_coordinates(element):
    return (element.geometry[0][1], element.geometry[0][0]) 

result = model(doc)
text = ""
 
for page in result.pages:
    line_list = []
    
    for block in page.blocks:
        line_list.extend(block.lines)
        
    sorted_lines = sorted(line_list, key=sort_by_coordinates)
    
    for line in sorted_lines:
        for word in line.words:
            text += word.text + " "
        text += "\n"
        
    text += "\n"

Cata400 avatar May 08 '24 14:05 Cata400

@kripper Have you already tried to disable block and/or line resolving ? https://mindee.github.io/doctr/using_doctr/using_models.html#two-stage-approaches

resolve_blocks=False resolve_lines=False

felixdittrich92 avatar May 08 '24 14:05 felixdittrich92

@kripper Have you already tried to disable block and/or line resolving ? https://mindee.github.io/doctr/using_doctr/using_models.html#two-stage-approaches

resolve_blocks=False resolve_lines=False

It's now mixing blocks multiple times per line.

What about taking a look at Tesseract's implementation?

kripper avatar May 08 '24 15:05 kripper

@kripper Have you already tried to disable block and/or line resolving ? https://mindee.github.io/doctr/using_doctr/using_models.html#two-stage-approaches resolve_blocks=False resolve_lines=False

It's now mixing blocks multiple times per line.

What about taking a look at Tesseract's implementation?

Sure :) Do you have a direct reference to the code or algorithm ?

felixdittrich92 avatar May 08 '24 15:05 felixdittrich92

Do you have a direct reference to the code or algorithm ?

No, but I will research tomorrow.

kripper avatar May 08 '24 15:05 kripper

Have you tried existing tools to convert doctr's HOCR output to text? There are many. Tesseract probably is also using some of them.

kripper avatar May 08 '24 16:05 kripper

Have you tried existing tools to convert doctr's HOCR output to text? There are many. Tesseract probably is also using some of them.

Yeah you can use doctr's XML/hocr output to create PDF/A files for example with OCRmyPDF

felixdittrich92 avatar May 08 '24 16:05 felixdittrich92

sometimes the models predict lines in the wrong block

The synthesized page looks fine. Identifying lines shouldn't be that difficult IMO.

out

kripper avatar May 08 '24 17:05 kripper

sometimes the models predict lines in the wrong block

The synthesized page looks fine. Identifying lines shouldn't be that difficult IMO.

out

Depends on the documents layout ^^ And there is a lot of difference (rotated, block text, etc.)

felixdittrich92 avatar May 08 '24 18:05 felixdittrich92