Uwe Hartwig

Results 61 comments of Uwe Hartwig
trafficstars

@stweil Guess what! The original Image works, too, if I scaled it down 1:4! Sorry, I didn't recognize the link is also off after a *single* download -I'm uploading again...

@stweil Strange indeed. I cropped *only the header* of the page and left left, right and bottom margins as they are and this version works fine. [0046-headless.zip](https://send.firefox.com/download/8fcf921e62888b27/#3DwmaFmvDpK229oYA_CEsg)

@zdenop Since the original image itself looks dark somehow, I called ImageMagick 6.9 to enhance contrast: `convert 0046.tif -brightness-contrast 25x50 -compress none -colorspace Gray 0046-convert.tif` This version, without taking care...

@zdenop This is what exiftool outputs: original: Megapixels 79.1, Image size 7477x10584, 79151624 Byte filesize convert: Megapixels 79.1, Image size 7477x10584, 79151434 Byte filesize So pure filesizes differs slightly.

@zdenop Did you run Tesseract also with the original file (without cropping or other types of preprocessing? What was the outcome? [0046-convert.zip](https://send.firefox.com/download/b0f41cbf7c10ddb2/#6XX1xd0GpfhP2z_4W7Fimg)

@stweil Many Thanks! By now I've detected already 200+ scans that are considered empty by Tesseract. Therefore I'll try your suggestion in our ULB-Fork and report back hopefully next week!

@stweil Sorry for the delay! I just took a quick shot at a single page and it did produce textlines which is per se good but forget about the quality....

For one of the problematic images I got: ``` /data/ocr-staging/ocr/1667524704_J_0190/0655.tif => 1667524704_J_0190_0655 => /data/ocr-staging/ocr/empty-pages/1667524704_J_0190_0655 Tesseract Open Source OCR Engine v5.0.0-alpha-754-g0838 with Leptonica Page 1 Detected 7102 diacritics index >= 0...

I've some larger tests with the patch @stweil provided, with the following results: From 133 images * 6 image produce the mentioned assertion error (`index >= 0 && index <...

@stweil I will run the patch with the 130+ images testset and report back early next week.