tesseract
tesseract copied to clipboard
Differences between command line and wrapper
Hi, i use tesseract.net from PDF to extract txt and get some invoices info. However, sometimes, recognition is doing some mistakes in numbers. Even if PDF file quality is OK. So i stored the PDF => bitmap conversion file and i compared with doing tesseract in command line on the same file. And tesseract in command line is doing ok. I used tesseract-ocr-w64-setup-v4.1.0.20190314.exe for command line. version :
>tesseract --version
tesseract v4.0.0.20190314
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.2.0
Found AVX2
Found AVX
Found SSE
When i use tesseract 20210617170848-1.bmp output1.txt -l fra i got my numbers correct inside the txt file but with code
using (var engine = new TesseractEngine(Server.MapPath(@"~/tessdata"), "fra", EngineMode.Default))
{
// have to load Pix via a bitmap since Pix doesn't support loading a stream.
using (var image = new System.Drawing.Bitmap(img))
{
using (var pix = PixConverter.ToPix(image))
{
using (var page = engine.Process(pix))
{
content = page.GetText();
}
}
}
}
i get different recognition content with some mistakes. How can i fix this ? thanks
Looks like you're comparing different versions of tesseract (we use 4.1 not 4.0). The version of the library we use has also disabled avx instructions etc for comparability with older machines. Not sure if this would affect the results though.
On Fri, 18 Jun 2021, 01:35 nelero, @.***> wrote:
Hi, i use tesseract.net from PDF to extract txt and get some invoices info. However, sometimes, recognition is doing some mistakes in numbers. Even if PDF file quality is OK. So i stored the PDF => bitmap conversion file and i compared with doing tesseract in command line on the same file. And tesseract in command line is doing ok. I used tesseract-ocr-w64-setup-v4.1.0.20190314.exe for command line. version :
tesseract --version tesseract v4.0.0.20190314 leptonica-1.78.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.2.0 Found AVX2 Found AVX Found SSE
When i use tesseract 20210617170848-1.bmp output1.txt -l fra i got my numbers correct inside the txt file but with code
using (var engine = new TesseractEngine(Server.MapPath(@"~/tessdata"), "fra", EngineMode.Default)) { // have to load Pix via a bitmap since Pix doesn't support loading a stream. using (var image = new System.Drawing.Bitmap(img)) { using (var pix = PixConverter.ToPix(image)) { using (var page = engine.Process(pix)) { content = page.GetText();
} } } }i get different recognition content with some mistakes. How can i fix this ? thanks
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/charlesw/tesseract/issues/557, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB7HSEZ4NLJ3X2DLJRHBNLTTII4VANCNFSM4635VLCQ .