PassportEye icon indicating copy to clipboard operation
PassportEye copied to clipboard

Better results with lang 'OCRB'

Open uwolfer opened this issue 6 years ago • 12 comments

I have played around a bit with this tool and got much better results with lang 'OCRB' which can be fetched from here: https://github.com/Exteris/tesseract-mrz/tree/master/lang

After adding it to tesseract data dir, I have added to the following line -l OCRB: https://github.com/konstantint/PassportEye/blob/master/passporteye/util/ocr.py#L30

Just wanted to let you know - you probably want to include this by default into this tool or add a hint to the README.

uwolfer avatar Mar 09 '18 23:03 uwolfer

Thanks. I long wanted to try an OCRA/B font-specific recognizer but never found the time to train one. It's natural that it should work better.

konstantint avatar Mar 10 '18 15:03 konstantint

I am getting mixed results. I see an average of 40% speedup over a substantial increase in correctness.

However it seems to be more prone to hallucinating 3 | S | Z characters instead of he < filler characters. The current recognizer on the other hand seems to have trouble with H vs M , so I still would consider it an improvement since the filler characters are easier to eliminate using some heuristics.

Also there are obviously some (different) outliers as some data dropped from valid=100 to completely invalid. Unfortunately this is production data, so cannot post.

lauri-elevant avatar Mar 23 '18 12:03 lauri-elevant

We are also changing the -l parameter (in our case, we use eng+spa+OCRB). Since this seems a common need, maybe it should be an extra parameter so the user can choose what to use there.

albertvaka avatar Jul 12 '18 10:07 albertvaka

Hello. It would be good to add the option to add extra parameters to the ocr.py

ionspicica avatar Oct 19 '18 12:10 ionspicica

Makes sense, indeed. I'll try not to forget to add it by tomorrow (or feel free to PR - it's an easy option for a hacktoberfest contribution after all, DO still has those t-shirts, right?)

konstantint avatar Oct 19 '18 12:10 konstantint

Has somebody already tried using this new traineddata? https://github.com/Shreeshrii/tessdata_ocrb

uwolfer avatar Jun 10 '19 19:06 uwolfer

@uwolfer I have tried the above mentioned data but getting poor result than before. Earlier read_mrz was providing more accurate result. Only with this is it provides data which is false many a times

hunaidkhan2000 avatar Jun 30 '20 09:06 hunaidkhan2000

How to try .traineddata file with passportEye, I am completely lost, want to read mrz region. I am newbie with tesseract ocr. Please help, some code snipped would work like charm. I am using mac. This is my current code, I have downloaded .traineddata file and kept it in tessdata folder along with default .traineddata files. How to use it in this code :

import os
from passporteye import read_mrz

pr_path = os.getcwd()
file_path = os.path.join(pr_path,'my_app', 'data')
mrz = read_mrz(file_path + '/test1.jpg') 

print(mrz)

kmanadkat avatar Aug 11 '20 05:08 kmanadkat

Hi @kmanadkat
let's say if you have downloaded OCRB pretrained data you just need to specify it while reading mrz file like

mrz = read_mrz('abu_2.jpg',extra_cmdline_params='-l OCRB') -##### you can change OCRB to your pretrained data, Also Make sure you have the file in tesseract folder tessdata folder.

hunaidkhan2000 avatar Aug 11 '20 06:08 hunaidkhan2000

@hunaidkhan2000 Thanks for helping out with, but I started getting bad result with this. I am using pytesseract directly and extracting mrz info with it, in that case I am getting good accuracy. Something like this :

text = pytesseract.image_to_string(img_path, lang="OCRB")

PassportEye needs lot of improvement. PS @kmanadkat is my another git account ;)

xi1570-krupeshanadkat avatar Aug 13 '20 02:08 xi1570-krupeshanadkat

@krupeshxebia yes i commented the same thing the results get poor if i use OCRB , The data needs to be trained or tessdata needs to get updated by google to provide us better results.

hunaidkhan2000 avatar Aug 13 '20 10:08 hunaidkhan2000

I have played around a bit with this tool and got much better results with lang 'OCRB' which can be fetched from here: https://github.com/Exteris/tesseract-mrz/tree/master/lang

After adding it to tesseract data dir, I have added to the following line -l OCRB: https://github.com/konstantint/PassportEye/blob/master/passporteye/util/ocr.py#L30

Just wanted to let you know - you probably want to include this by default into this tool or add a hint to the README.

I am struggling to do this - can you provide more explicit instructions on how to extract this to the tesseract folder?

leedrake5 avatar Nov 25 '21 17:11 leedrake5