PaddleOCR icon indicating copy to clipboard operation
PaddleOCR copied to clipboard

Smoothing color images and black and white text images for OCR

Open khawar-islam opened this issue 2 years ago • 4 comments

I have implemented OCR to recognize numbers in documents and later I will hide the number like National Security Number/Resident registration number. The folder contains different types of color and black white images (passport, business documents).

The problem is that sometimes OCR could not detect the number and skip the number like in image "Resident Registration Number". OCR skip the "Resident Registration Number" while it is clear.

How to solve this problem?

Code

import glob
import re
import string
from itertools import chain

# import self as self

from paddleocr import PaddleOCR, draw_ocr, PPStructure
import cv2
import os
import regex

pattern_8 = r'-\d{8}'
pattern_4 = r'-\d{4}'
my_path = "/media/cvpr/CM_1/COREMAX/testing/"


ocr = PaddleOCR(rec=True, use_angle=True, lang='korean', use_gpu=True)
face_cascade = cv2.CascadeClassifier('/media/cvpr/CM_1/pytesseract/haarcascade_frontalface_default.xml')

save_path = '/media/cvpr/CM_1/COREMAX/paddle/'

# Regex
passport_pattern = '^[A-Z0-9<]{9}[0-9]{1}[A-Z]{3}[0-9]{7}[A-Z]{1}[0-9]{7}[A-Z0-9<]{14}[0-9]{2}$'

for img in glob.glob(my_path + '*.*'):
    img_bgr_rgb = cv2.imread(img)
    file_Name = os.path.basename(img)
    #image = img_bgr_rgb[:, :, ::-1]

    # Not Good results
    #thresh, im_bw = cv2.threshold(image, 210, 230, cv2.THRESH_BINARY)
    #cv2.imwrite("bw_image.jpg", im_bw)

    face_data = face_cascade.detectMultiScale(img_bgr_rgb, 1.3, 5)
    for (x, y, w, h) in face_data:
        roi = img_bgr_rgb[y:y + h, x:x + w]
        roi = cv2.GaussianBlur(roi, (25, 25), cv2.BORDER_ISOLATED)
        img_bgr_rgb[y:y + roi.shape[0], x:x + roi.shape[1]] = roi

    result = ocr.ocr(img_bgr_rgb, cls=True)

    for x in result:
        if regex.search(r'\.[0-9.]+', str(x[1][0])):
            #print(x[1][0])
            x1 = int(x[0][0][0])
            y1 = int(x[0][0][1])
            x2 = int(x[0][2][0])
            y2 = int(x[0][2][1])
            cv2.rectangle(img_bgr_rgb, (x1, y1), (x2, y2), (255, 255, 224), cv2.FILLED)
            cv2.imwrite(os.path.join(save_path, file_Name), img_bgr_rgb)
        elif '-' in str(x[1][0]):
            print(x[1][0])
            x1 = int(x[0][0][0])
            y1 = int(x[0][0][1])
            x2 = int(x[0][2][0])
            y2 = int(x[0][2][1])
            cv2.rectangle(img_bgr_rgb, (x1, y1), (x2, y2), (255, 255, 224), cv2.FILLED)
            cv2.imwrite(os.path.join(save_path, file_Name), img_bgr_rgb)

Output of OCR

5318-864-2206-293
21i-87-50168
서울득벌시 강납구 논현로 149 길 67-7
57-7 Noihyeou-r0149-Bi
Gangnam-gu Seoull Korea
니1sd O -ilg s- [iSlticl 7a8 0016

Output Image image

Output Image image

khawar-islam avatar Aug 03 '22 04:08 khawar-islam

The reason why the numbers are not obtained is that the text recognition model of PaddleOCR has a poor recognition effect on numbers.

Current text detection result: image

Current text recognition result for Resident Registration Number. We can easily find that the recognition effect of Resident Registration Number is not good.

image

Of course, it is also possible to alleviate the problem of recognition errors by adjusting some parameters. In the case you provided, by setting use_dilation=True to expand the text detection frame, the Resident Registration Number can be recognized.

image

The method for set use_dilatioin as True.

from paddleocr import PaddleOCR, draw_ocr

ocr = PaddleOCR(lang='korean', use_gpu=True, use_angle=True, use_dilation=True)

However, the model for recognizing Korean is trained using synthetic data. Without real data to participate in the training, the recognition effect is not good. It is recommended to label Korean data to retrain the Korean recognition model.

LDOUBLEV avatar Aug 03 '22 07:08 LDOUBLEV

What is your paddleOCR version? I have checked your code but it didnt work for me maybe there is version problem.

khawar-islam avatar Aug 03 '22 07:08 khawar-islam

paddleocr 2.5.0.3

the code

from paddleocr import PaddleOCR, draw_ocr

ocr = PaddleOCR(lang='korean', use_gpu=True, use_angle=True, use_dilation=True)  # need to run only once to download and load model into memory
img_path = './182526678-562dcba8-af71-4e40-9f69-7f44d9848c9f.png'
result = ocr.ocr(img_path, cls=False)
for line in result:
    print(line)

from PIL import Image

image = Image.open(img_path).convert('RGB')
boxes = [line[0] for line in result]
txts = [line[1][0] for line in result]
scores = [line[1][1] for line in result]
im_show = draw_ocr(image, boxes, txts, scores, font_path='doc/fonts/korean.ttf')
im_show = Image.fromarray(im_show)
im_show.save('result.jpg')

LDOUBLEV avatar Aug 03 '22 08:08 LDOUBLEV

I have checked it on 2.5.0 but it did not work for me, did you check on my code?

khawar-islam avatar Aug 03 '22 08:08 khawar-islam