OCRmyPDF issues

Azure ocr with ocrmypdf

18

ocrmypdf works great with pdfs with scanned images . However in case of handwritten letter, the tessaract-ocr engine struggles many a time. How do I use Azure ocr API as...

sandipan1

enhancement

OCRmyPDF assumes really large DPI for native PDF when rasterizing as image

1

I use OCRmyPDF for processing lots of "native" PDFs (with that I mean PDFs generated by Word, etc.). Due to some constraints a lot of these PDFs have to be...

fabiante

Refactor code that tramples previous variable assignment (was: possible bug)

1

I may be missing something but it seems that the value of `self._has_text` set in this section of code: https://github.com/ocrmypdf/OCRmyPDF/blob/5c6030960945fe299291fa134cff35c86a644b9f/src/ocrmypdf/pdfinfo/info.py#L779-L788 is always overwritten here: https://github.com/ocrmypdf/OCRmyPDF/blob/5c6030960945fe299291fa134cff35c86a644b9f/src/ocrmypdf/pdfinfo/info.py#L804-L822

kshpytsya

robustness

hocr import / export

33

**Describe the issue** If you want to create a perfect OCR, 100% correct text, you need some editing function. For example "gImageReader" gives some basic editing function (but has some...

aalmir

enhancement

Option to remove blank pages

19

**Issue by [drdownload](https://github.com/drdownload)** _Thu Oct 30 08:25:16 2014_ _Originally opened as https://github.com/fritz-hh/OCRmyPDF/issues/98_ --- it would be great to have an option to remove blank pages. I scan a lot of...

OCRmyPDF-issuebot

enhancement

Added GitHub Action

6

Added a GitHub action for this project. You can find the action here. [OCR PDF Action: A GitHub action for turning scanned PDF's into searchable documents](https://github.com/MarketingPipeline/OCR-PDF-Action) :+1:

MarketingPip

Introduce a way to radically reduce the output file size (sacrificing image quality)

75

**Is your feature request related to a problem? Please describe.** My use case is "scanning" documents with a smartphone camera, then archiving those "scans" as low-quality monochrome images. But OCR...

heinrich-ulbricht

enhancement

extra space in the result pdf when the input pdf is in Chinese

20

Hi. First, sorry for my poor English. **Description** Recently I upgraded my tesseract engine from v4.0.0.20181030 to v5.0.0-alpha.20201127 and two things happened. One is there is space between every single...

youhonghui

third party issue

optimize.py doesn't process images with subtype Form

1

**Describe the bug** After rearranging pages with `pdfjam` in a scanned document, the resulting file with images cannot be optimized, because the image type is unexpected (`/Form`). **To Reproduce** A...

imz

Just saying thanks!!!

14

@jbarlow83, you're amazing for putting this out here. Just wanted to drop a note to say thanks! :smile:

ericmjl

OCRmyPDF
OCRmyPDF copied to clipboard

Metadata

Azure ocr with ocrmypdf

OCRmyPDF assumes really large DPI for native PDF when rasterizing as image

Refactor code that tramples previous variable assignment (was: possible bug)

hocr import / export

Option to remove blank pages

Added GitHub Action

Introduce a way to radically reduce the output file size (sacrificing image quality)

extra space in the result pdf when the input pdf is in Chinese

optimize.py doesn't process images with subtype Form

Just saying thanks!!!

← Metadata

Owner

Metadata

OCRmyPDF OCRmyPDF copied to clipboard

Metadata

← Metadata

Owner

Metadata

OCRmyPDF
OCRmyPDF copied to clipboard