leptess icon indicating copy to clipboard operation
leptess copied to clipboard

different result with tesseract cli and leptess wrapper

Open tcastelly opened this issue 3 years ago • 5 comments

Hello,

Thank you for this work!

I have a curious behavior, when I try to retrieve the text from the image bellow in command line:

time tesseract image.jpg output  

I have as result,

Coco Adel

But when I use the wrapper

fn main() {
    let mut lt = leptess::LepTess::new(Some("./tests"), "eng").unwrap();
    // let mut lt = leptess::LepTess::new(None, "eng").unwrap();
    lt.set_image("image.jpg");
    println!("{}", lt.get_utf8_text().unwrap());
}

I have:

rh

I've tried to use the traineddata from this repository. Or nothing. But same result.

Maybe the command line use default parameters.

Thanks in advance

image

tcastelly avatar Jan 25 '22 10:01 tcastelly

Hod did you install tesseract and libtesseract? What version of tessearct do you have?

houqp avatar Jan 25 '22 16:01 houqp

Thank you for your answer.

I'm on Gnu Archlinux, I installed:

pacman -S tesseract leptonica tesseract-data-eng
tesseract 4.1.1
 leptonica-1.82.0
  libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.1) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2
 Found AVX2
 Found AVX
 Found FMA
 Found SSE
 Found libarchive 3.5.2 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.5.0

tcastelly avatar Jan 25 '22 20:01 tcastelly

My tesseract was installed through Fedora's dnf install tesseract command

tesseract 4.1.3
 leptonica-1.81.1
  libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 2.1.0) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2
 Found AVX512BW
 Found AVX512F
 Found AVX2
 Found AVX
 Found FMA
 Found SSE

My tesseract command gives the expected Coco Adel output. Through leptess, I also get rh\n.

Converting the image to a png changed leptess's output slightly "Nr\n".

I created a new image with the same resolution and similar sized text and leptess was able to parse it correctly. issue_41

I don't know why the command and API have different behaviour on your image. It may be worth checking to see if the command sets any additional options.

ccouzens avatar Jan 30 '22 17:01 ccouzens

Yeah, most likely that the command line uses different set of default options :(

houqp avatar Jan 30 '22 18:01 houqp

The default page seg mode for leptess is set to 6, which is block mode, and the default value for tesseract would be 3, which is auto.

Setting this variable manually would get the same result:

lt.set_variable(Variable::TesseditPagesegMode, "3").unwrap();

~~So, maybe the default value for page seq mode for leptess should set to 3 to consistent with tesseract, and also preventing someone get unexpected results.~~

FYI The cli set default page seg mode to PSM_AUTO:

https://github.com/tesseract-ocr/tesseract/blob/be15b46c609e6d50f1665345d6e6fc128462593c/src/tesseract.cpp#L650

But PSM_SINGLE_BLOCK in library.

https://github.com/tesseract-ocr/tesseract/blob/be15b46c609e6d50f1665345d6e6fc128462593c/include/tesseract/publictypes.h#L166

ongchi avatar May 18 '22 02:05 ongchi