tesseract
tesseract copied to clipboard
Wrong coordinates with .box when chi_tra_vert*.traineddata is used
Hi, many thanks to this fantastic work and all of you! I am here to report some wired situations about coordinates when chi_tra_vert_*.traineddata is used.
tesseract 4.1.0 leptonica-1.78.0 libgif 5.1.4 : libjpeg 9c : libpng 1.6.37 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 1.0.2 : libopenjp2 2.3.1 Found AVX2 Found AVX Found SSE Found libarchive 3.3.3 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.6
ProductName: Mac OS X ProductVersion: 10.13.6 BuildVersion: 17G65
-
tesseract with makebox set all characters' X coordinates and their width to zero
tesseract [--oem 1] chi_tra_vert_test_1.jpg chi_tra_vert_1_test -l chi_tra_vert makebox
-
tesseract with lstmbox failed
tesseract [--oem 1] chi_tra_vert_test_2.jpg chi_tra_vert_2_test -l chi_tra_vert lstmbox
And here are my questions:
- Why I got wrong coordinates?
- Why the OCR characters results are right while their coordinates are wrong?
- Though nothing related to the wired cases. Noticed that the vertical Chinese characters are only supported by 4.x versions, and 4.x versions only have the line-level bounding-boxs as their labeled data. How can the tesseract recognize the single character in the line?
- Noticed that there is not GPUs training method, it's a little disturbing to train a lstm-based nerual network with CPUs, any experience(datasets amount and the time cost, etc) would really help!
Best!
tesseract -v
tesseract 5.0.0-alpha-20201231-111-ge1b9
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found NEON
Found OpenMP 201511
Found libarchive 3.2.2 zlib/1.2.11 liblzma/5.2.2 bz2lib/1.0.6 liblz4/1.7.1
Found libcurl/7.58.0 NSS/3.35 zlib/1.2.11 libidn2/2.0.4 libpsl/0.19.1 (+libidn2/2.0.4) nghttp2/1.30.0 librtmp/2.3
Cropped version of first image above and results of makebox, lstmbox and wordstrbox
(base) ubuntu@tesseract-ocr-1:~/TEST$ tesseract chi_tra.jpg chi_tra -l chi_tra_vert wordstrbox
Tesseract Open Source OCR Engine v5.0.0-alpha-20201231-111-ge1b9 with Leptonica
(base) ubuntu@tesseract-ocr-1:~/TEST$ tesseract chi_tra.png - -l chi_tra_vert wordstrbox
WordStr 75 2 118 245 0 #保定 易 州 查 學
119 2 123 245 0
WordStr 6 0 47 332 0 #前 半球 後 十 八 日 即
48 0 52 332 0
(base) ubuntu@tesseract-ocr-1:~/TEST$ tesseract chi_tra.png - -l chi_tra_vert lstmbox
保 75 2 123 245 0
定 75 2 123 245 0
75 2 123 245 0
易 75 2 123 245 0
75 2 123 245 0
州 75 2 123 245 0
75 2 123 245 0
查 75 2 123 245 0
75 2 123 245 0
學 75 2 123 245 0
75 2 123 245 0
前 6 0 52 332 0
6 0 52 332 0
半 6 0 52 332 0
球 6 0 52 332 0
6 0 52 332 0
後 6 0 52 332 0
6 0 52 332 0
十 6 0 52 332 0
6 0 52 332 0
八 6 0 52 332 0
6 0 52 332 0
日 6 0 52 332 0
6 0 52 332 0
即 6 0 52 332 0
6 0 52 332 0
(base) ubuntu@tesseract-ocr-1:~/TEST$ tesseract chi_tra.png - -l chi_tra_vert makebox
保 0 76 0 118 0
定 0 75 0 116 0
易 0 66 0 122 0
州 0 66 0 122 0
查 0 66 0 122 0
學 0 66 0 122 0
前 0 9 0 47 0
半 0 6 0 47 0
球 0 8 0 47 0
後 0 0 0 51 0
十 0 0 0 51 0
八 0 0 0 51 0
日 0 0 0 51 0
即 0 0 0 51 0
I ran into the same issue using the pytesseract wrapper.
tesseract -v
:
tesseract 4.1.3
leptonica-1.81.1
libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 2.1.0) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2
Found AVX
Found SSE
pip freeze
:
pytesseract==0.3.9
Input img:
Code:
import pytesseract
import cv2
from PIL import Image
img = cv2.imread("img.png")
img_conv = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
img_pil = Image.fromarray(img_conv)
height = img.shape[0]
width = img.shape[1]
d = pytesseract.image_to_boxes(img_pil,lang="chi_tra_vert",config=' -c tessedit_create_boxfile=1 --dpi 100 --tessdata-dir ./',output_type=pytesseract.Output.DICT)
print(d)
for i in range(0,len(d["left"])):
(text,x1,y2,x2,y1) = (d['char'][i],d['left'][i],d['top'][i],d['right'][i],d['bottom'][i])
cv2.rectangle(img, (x1,height-y1), (x2,height-y2) , (0,255,0), 2)
cv2.imshow('img', img)
cv2.waitKey(0)
output:
{'char': ['國', '之', '章', '、', '藍', '英', '國', '下', '繁', '始', '比', '坊', '和', '好', '疏', '策', '、', '不', '過', '對', '於', '新', '本'], 'left': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'bottom': [70, 70, 62, 62, 62, 70, 70, 62, 39, 37, 42, 30, 40, 40, 39, 7, 15, 11, 10, 9, 7, 0, 8], 'right': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'top': [93, 95, 99, 99, 99, 95, 95, 99, 60, 64, 61, 70, 60, 60, 60, 30, 26, 31, 33, 31, 31, 38, 32], 'page': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
pytesseract.image_to_boxes
uses makebox
as well. Any idea why left
and right
end up being all-zero? Also, top
and bottom
coordinates are obviously incorrect as well. I tested the same code for an input image with English text and lang="eng" and it worked perfectly fine.