tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

Differences between command line and wrapper

Open nelero opened this issue 4 years ago • 1 comments

Hi, i use tesseract.net from PDF to extract txt and get some invoices info. However, sometimes, recognition is doing some mistakes in numbers. Even if PDF file quality is OK. So i stored the PDF => bitmap conversion file and i compared with doing tesseract in command line on the same file. And tesseract in command line is doing ok. I used tesseract-ocr-w64-setup-v4.1.0.20190314.exe for command line. version :

>tesseract --version
tesseract v4.0.0.20190314
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.2.0
 Found AVX2
 Found AVX
 Found SSE

When i use tesseract 20210617170848-1.bmp output1.txt -l fra i got my numbers correct inside the txt file but with code

using (var engine = new TesseractEngine(Server.MapPath(@"~/tessdata"), "fra", EngineMode.Default))
            {
                // have to load Pix via a bitmap since Pix doesn't support loading a stream.
                using (var image = new System.Drawing.Bitmap(img))
                {
                    using (var pix = PixConverter.ToPix(image))
                    {
                        using (var page = engine.Process(pix))
                        {
                            content = page.GetText();

                        }
                    }
                }
            }

i get different recognition content with some mistakes. How can i fix this ? thanks

nelero avatar Jun 17 '21 15:06 nelero

Looks like you're comparing different versions of tesseract (we use 4.1 not 4.0). The version of the library we use has also disabled avx instructions etc for comparability with older machines. Not sure if this would affect the results though.

On Fri, 18 Jun 2021, 01:35 nelero, @.***> wrote:

Hi, i use tesseract.net from PDF to extract txt and get some invoices info. However, sometimes, recognition is doing some mistakes in numbers. Even if PDF file quality is OK. So i stored the PDF => bitmap conversion file and i compared with doing tesseract in command line on the same file. And tesseract in command line is doing ok. I used tesseract-ocr-w64-setup-v4.1.0.20190314.exe for command line. version :

tesseract --version tesseract v4.0.0.20190314 leptonica-1.78.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.2.0 Found AVX2 Found AVX Found SSE

When i use tesseract 20210617170848-1.bmp output1.txt -l fra i got my numbers correct inside the txt file but with code

using (var engine = new TesseractEngine(Server.MapPath(@"~/tessdata"), "fra", EngineMode.Default)) { // have to load Pix via a bitmap since Pix doesn't support loading a stream. using (var image = new System.Drawing.Bitmap(img)) { using (var pix = PixConverter.ToPix(image)) { using (var page = engine.Process(pix)) { content = page.GetText();

                    }
                }
            }
        }

i get different recognition content with some mistakes. How can i fix this ? thanks

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/charlesw/tesseract/issues/557, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB7HSEZ4NLJ3X2DLJRHBNLTTII4VANCNFSM4635VLCQ .

charlesw avatar Jun 17 '21 20:06 charlesw