PaddleOCR Smoothing color images and black and white text images for OCR

I have implemented OCR to recognize numbers in documents and later I will hide the number like National Security Number/Resident registration number. The folder contains different types of color and black white images (passport, business documents).

The problem is that sometimes OCR could not detect the number and skip the number like in image "Resident Registration Number". OCR skip the "Resident Registration Number" while it is clear.

How to solve this problem?

Code

import glob
import re
import string
from itertools import chain

# import self as self

from paddleocr import PaddleOCR, draw_ocr, PPStructure
import cv2
import os
import regex

pattern_8 = r'-\d{8}'
pattern_4 = r'-\d{4}'
my_path = "/media/cvpr/CM_1/COREMAX/testing/"


ocr = PaddleOCR(rec=True, use_angle=True, lang='korean', use_gpu=True)
face_cascade = cv2.CascadeClassifier('/media/cvpr/CM_1/pytesseract/haarcascade_frontalface_default.xml')

save_path = '/media/cvpr/CM_1/COREMAX/paddle/'

# Regex
passport_pattern = '^[A-Z0-9<]{9}[0-9]{1}[A-Z]{3}[0-9]{7}[A-Z]{1}[0-9]{7}[A-Z0-9<]{14}[0-9]{2}$'

for img in glob.glob(my_path + '*.*'):
    img_bgr_rgb = cv2.imread(img)
    file_Name = os.path.basename(img)
    #image = img_bgr_rgb[:, :, ::-1]

    # Not Good results
    #thresh, im_bw = cv2.threshold(image, 210, 230, cv2.THRESH_BINARY)
    #cv2.imwrite("bw_image.jpg", im_bw)

    face_data = face_cascade.detectMultiScale(img_bgr_rgb, 1.3, 5)
    for (x, y, w, h) in face_data:
        roi = img_bgr_rgb[y:y + h, x:x + w]
        roi = cv2.GaussianBlur(roi, (25, 25), cv2.BORDER_ISOLATED)
        img_bgr_rgb[y:y + roi.shape[0], x:x + roi.shape[1]] = roi

    result = ocr.ocr(img_bgr_rgb, cls=True)

    for x in result:
        if regex.search(r'\.[0-9.]+', str(x[1][0])):
            #print(x[1][0])
            x1 = int(x[0][0][0])
            y1 = int(x[0][0][1])
            x2 = int(x[0][2][0])
            y2 = int(x[0][2][1])
            cv2.rectangle(img_bgr_rgb, (x1, y1), (x2, y2), (255, 255, 224), cv2.FILLED)
            cv2.imwrite(os.path.join(save_path, file_Name), img_bgr_rgb)
        elif '-' in str(x[1][0]):
            print(x[1][0])
            x1 = int(x[0][0][0])
            y1 = int(x[0][0][1])
            x2 = int(x[0][2][0])
            y2 = int(x[0][2][1])
            cv2.rectangle(img_bgr_rgb, (x1, y1), (x2, y2), (255, 255, 224), cv2.FILLED)
            cv2.imwrite(os.path.join(save_path, file_Name), img_bgr_rgb)

Output of OCR

5318-864-2206-293
21i-87-50168
서울득벌시 강납구 논현로 149 길 67-7
57-7 Noihyeou-r0149-Bi
Gangnam-gu Seoull Korea
니1sd O -ilg s- [iSlticl 7a8 0016

Output Image

Output Image

Aug 03 '22 04:08 khawar-islam

The reason why the numbers are not obtained is that the text recognition model of PaddleOCR has a poor recognition effect on numbers.

Current text detection result:

Current text recognition result for Resident Registration Number. We can easily find that the recognition effect of Resident Registration Number is not good.

Of course, it is also possible to alleviate the problem of recognition errors by adjusting some parameters. In the case you provided, by setting use_dilation=True to expand the text detection frame, the Resident Registration Number can be recognized.

The method for set use_dilatioin as True.

from paddleocr import PaddleOCR, draw_ocr

ocr = PaddleOCR(lang='korean', use_gpu=True, use_angle=True, use_dilation=True)

However, the model for recognizing Korean is trained using synthetic data. Without real data to participate in the training, the recognition effect is not good. It is recommended to label Korean data to retrain the Korean recognition model.

Aug 03 '22 07:08 LDOUBLEV

What is your paddleOCR version? I have checked your code but it didnt work for me maybe there is version problem.

Aug 03 '22 07:08 khawar-islam

paddleocr 2.5.0.3

the code

from paddleocr import PaddleOCR, draw_ocr

ocr = PaddleOCR(lang='korean', use_gpu=True, use_angle=True, use_dilation=True)  # need to run only once to download and load model into memory
img_path = './182526678-562dcba8-af71-4e40-9f69-7f44d9848c9f.png'
result = ocr.ocr(img_path, cls=False)
for line in result:
    print(line)

from PIL import Image

image = Image.open(img_path).convert('RGB')
boxes = [line[0] for line in result]
txts = [line[1][0] for line in result]
scores = [line[1][1] for line in result]
im_show = draw_ocr(image, boxes, txts, scores, font_path='doc/fonts/korean.ttf')
im_show = Image.fromarray(im_show)
im_show.save('result.jpg')

Aug 03 '22 08:08 LDOUBLEV

I have checked it on 2.5.0 but it did not work for me, did you check on my code?

Aug 03 '22 08:08 khawar-islam

PaddleOCR PaddleOCR copied to clipboard

Smoothing color images and black and white text images for OCR

PaddleOCR
PaddleOCR copied to clipboard