jbarlow comments

Results 491 comments of


                                            jbarlow

Output PDF is getting distorted on each ocrmypdf command.

@DEEPAK-KESWANI I just did `ocrmypdf -f ~searchable-text-issue1-1-1.pdf _.pdf`. I was just replicating you did, to see how Tesseract 4 changed the results. `ocrmypdf --redo-ocr`... will produce a smaller file. That's...

Output PDF is getting distorted on each ocrmypdf command.

@ajab21 Tesseract does indeed have a known issue with background shading. The underlying issue is that it converts all inputs to 1-bit monochrome first using a thresholder, and currently this...

Output PDF is getting distorted on each ocrmypdf command.

Thank for your contribution, it is very much appreciated. > Regarding the dilemma with Page 4, we're not having the same issues when running OCR using a different program like...

Output PDF is getting distorted on each ocrmypdf command.

Use --optimize 0 and --output-type pdf to disable and decompression. Image resolution never changes by default but recompression can occur. On Sun., Apr. 26, 2020, 13:30 Laurent Meyer, wrote: >...

Option to remove blank pages

I'm close to releasing a new version most of which is in the `api` branch which could (should) hopefully make this sort of thing easier since there will be a...

Option to remove blank pages

@enterframe That message simply says that too few characters were recognized on a particular page, so Tesseract assumed that none of them were valid. It did not stop process, it...

Option to remove blank pages

I agree - digital blank has significant advantages in most cases. If only there were a reliable algorithm for blank page detection.... I think it may be a machine learning...

Text layer not aligned with original document

The imgur link didn't work unfortunately. Please upload a PDF. It's very unlikely I will be able to investigate this issue without an example input PDF.

Text layer not aligned with original document

The alternative renderer `--pdf-renderer hocr` does a better job here. Note that it has issues with non-Latin text. I believe this is a regression in Tesseract 4.1.0. Related past issue:...

Text layer not aligned with original document

Inserting spaces between words is the job of the PDF viewer due to unfortunate design decisions in the early days of PDF. macOS Preview also does a particular poor job...