PaddleOCR icon indicating copy to clipboard operation
PaddleOCR copied to clipboard

return_word_box parameter with unexpected behavior

Open denpawy opened this issue 1 month ago • 4 comments

🔎 Search before asking

  • [x] I have searched the PaddleOCR Docs and found no similar bug report.
  • [x] I have searched the PaddleOCR Issues and found no similar bug report.
  • [x] I have searched the PaddleOCR Discussions and found no similar bug report.

🐛 Bug (问题描述)

I am writing a processing pipeline and use PaddleOCR as a step within said pipeline because of the brilliant quality and results I saw so far. Right now I am looking forward to use the parameter return_word_box in the PaddleOCR python package (https://github.com/PaddlePaddle/PaddleOCR).

I created a test PDF and I am focused on german documents right now.

The issue is that PaddleOCR somehow splits email addresses or words with german diacritics during the word segmentation into separate words.

The following are examples of the first page result using the PaddleOCR(use_angle_cls=True, lang="de", use_doc_unwarping=False, det_limit_side_len=4096, det_limit_type="max", return_word_box=True).ocr(img_path) method: Value in rec_texts (index 7): 'Mit freundlichen Grüßen' Value in text_word (index 7): ['Mit', ' ', 'freundlichen', ' ', 'Gr', 'üß', 'en']

Value in rec_texts (index 32): 'Manchmal begegnet man auch ungewöhnlichen Adressen wie [email protected] oder' Value in text_word (index 32): ['Manchmal', ' ', 'begegnet', ' ', 'man', ' ', 'auch', ' ', 'ungew', 'ö', 'hnlichen', ' ', 'Adressen', ' ', 'wie', ' ', 'alex', '.', 'k-93', '@', 'devmail', '.', 'io', ' ', 'oder']

I consider this a bug, hence I did not put it up as a Q&A, please correct it if I am wrong.

Here is my pdf which I transormed into images per page and then used with the ocr method of PaddleOCR: email_t_3.pdf

🏃‍♂️ Environment (运行环境)

OS: Windows 11 Enterprise
OS build: 26200.7171
Environment: FastAPI
CPU: 13th Gen Intel(R) Core(TM) i9-13980HX (2.20 GHz)
RAM: 64 GB
CUDA: None
Install: Poetry/pip
Python: 3.12
PaddleOCR: 3.3.1

🌰 Minimal Reproducible Example (最小可复现问题的Demo)


import os
import fitz
import tempfile
from paddleocr import PaddleOCR
from PIL import Image

def test_word_splitting(pdf_path):
    """Convert PDF pages to temporary images."""
    temp_images = []
    pdf_document = fitz.open(pdf_path)

    ocr = PaddleOCR(
        use_angle_cls=True,
        lang="de",
        use_doc_unwarping=False,
        det_limit_side_len=4096,
        det_limit_type="max",
        return_word_box=True
    )

    for page_num in range(pdf_document.page_count):
        page = pdf_document[page_num]
        pix = page.get_pixmap()

        with tempfile.NamedTemporaryFile(suffix='.png') as temp_file:
            pix.save(temp_file.name)
            temp_images.append(temp_file.name)

            result = ocr.predict(temp_file.name)
            for page_result in result:
                if page_result is None:
                    continue

                for i, rec_text in enumerate(page_result["rec_texts"]):
                    word_boxes = page_result["text_word"][i]  # Word box information

                    print(f"Index:         {i}")
                    print(f"Original text: {rec_text}")
                    print(f"Word boxes:    {word_boxes}")
                    print("-" * 50)

    pdf_document.close()
    return temp_images

if __name__ == "__main__":
    pdf_path = "./email_t_3.pdf"
    test_word_splitting(pdf_path)

denpawy avatar Nov 21 '25 13:11 denpawy

Thanks for your feedback. We’ll look into this issue shortly.

scyyh11 avatar Nov 24 '25 07:11 scyyh11

The split comes directly from the way return_word_box groups characters. In ppocr/postprocess/rec_postprocess.py, BaseRecLabelDecode.get_word_info classifies each decoded character:

  • letters/digits matching [a-zA-Z0-9] get the state "en&num"
  • Chinese characters get "cn"
  • everything else becomes "splitter"

Only characters inside a single state are grouped into one entry. For email addresses, the sequences alex, k-93, devmail, io stay in "en&num" because they are alphanumeric (and - is explicitly allowed when the previous state is "en&num"). However, symbols such as . and @ do not match [a-zA-Z0-9], so they switch the state to "splitter". Each state change closes the current group, so the email breaks into alex . k-93 @ devmail . io.

In other words, return_word_box currently treats punctuation as separators, so email components are reported separately even though the full string is correct in rec_texts. To keep the whole address together you’d need to extend the grouping logic so that characters like ., @, _, +, etc., can remain in the "en&num" state when they appear inside an alphanumeric word.

We’ll discuss this behavior internally and consider relaxing the grouping rule for email-like patterns in a future update.

scyyh11 avatar Nov 24 '25 07:11 scyyh11

So I'd be better off writing a custom aggregation using the rec_texts texts split be space and text_word split coordinations?

And about the diacritics like ä, ö, ü, is there any option to pass a custom regex to the splitter? I wouldn't think about forking, I'd try any other algorithmic approach though.

denpawy avatar Nov 24 '25 07:11 denpawy