jbarlow
jbarlow
@DEEPAK-KESWANI I just did `ocrmypdf -f ~searchable-text-issue1-1-1.pdf _.pdf`. I was just replicating you did, to see how Tesseract 4 changed the results. `ocrmypdf --redo-ocr`... will produce a smaller file. That's...
@ajab21 Tesseract does indeed have a known issue with background shading. The underlying issue is that it converts all inputs to 1-bit monochrome first using a thresholder, and currently this...
Thank for your contribution, it is very much appreciated. > Regarding the dilemma with Page 4, we're not having the same issues when running OCR using a different program like...
Use --optimize 0 and --output-type pdf to disable and decompression. Image resolution never changes by default but recompression can occur. On Sun., Apr. 26, 2020, 13:30 Laurent Meyer, wrote: >...
I'm close to releasing a new version most of which is in the `api` branch which could (should) hopefully make this sort of thing easier since there will be a...
@enterframe That message simply says that too few characters were recognized on a particular page, so Tesseract assumed that none of them were valid. It did not stop process, it...
I agree - digital blank has significant advantages in most cases. If only there were a reliable algorithm for blank page detection.... I think it may be a machine learning...
The imgur link didn't work unfortunately. Please upload a PDF. It's very unlikely I will be able to investigate this issue without an example input PDF.
The alternative renderer `--pdf-renderer hocr` does a better job here. Note that it has issues with non-Latin text. I believe this is a regression in Tesseract 4.1.0. Related past issue:...
Inserting spaces between words is the job of the PDF viewer due to unfortunate design decisions in the early days of PDF. macOS Preview also does a particular poor job...