OCRmyPDF icon indicating copy to clipboard operation
OCRmyPDF copied to clipboard

Output PDF is getting distorted on each ocrmypdf command.

Open DEEPAK-KESWANI opened this issue 5 years ago • 15 comments

Hi,

Please see the attached image where it shows the output PDF is getting distorted on each ocrmypdf command.

distorted_from_v1 0_to_v1 4

FYI, we are using auto-rotate options (--rotate-pages --rotate-pages-threshold 1) only for 1st version and for the rest versions PDF, we are not using the auto-rotate option.

sudo ocrmypdf --verbose 1 --force-ocr -l eng --output-type pdf --rotate-pages --rotate-pages-threshold 1 v_1.0.pdf v_1.1.pdf

sudo ocrmypdf --verbose 1 --force-ocr -l eng --output-type pdf v_1.1.pdf v_1.2.pdf

sudo ocrmypdf --verbose 1 --force-ocr -l eng --output-type pdf v_1.2.pdf v_1.3.pdf

sudo ocrmypdf --verbose 1 --force-ocr -l eng --output-type pdf v_1.3.pdf v_1.4.pdf

NOTE: OCRMyPDF version: 7.0.0

Could you please help me on this?

Also, if I add --oversample 600 option to command in each version, it works fine but output pdf size has increased.

sudo ocrmypdf --verbose 1 --force-ocr -l eng --output-type pdf --oversample 600 --rotate-pages --rotate-pages-threshold 1 v_2.0.pdf v_2.1.pdf

sudo ocrmypdf --verbose 1 --force-ocr -l eng --output-type pdf --oversample 600 v_2.1.pdf v_2.2.pdf

sudo ocrmypdf --verbose 1 --force-ocr -l eng --output-type pdf --oversample 600 v_2.2.pdf v_2.3.pdf

sudo ocrmypdf --verbose 1 --force-ocr -l eng --output-type pdf --oversample 600 v_2.3.pdf v_2.4.pdf
 

Thanks.

DEEPAK-KESWANI avatar Nov 26 '18 05:11 DEEPAK-KESWANI

The --force-ocr options asks for the page to be rasterized, so the appearance will differ.

When using this mode, ocrmypdf tries to guess an appropriate resolution to rasterize at. In v7.0, I believe it does not take some factors into account that should be. --oversample overrides the guessed resolution with the one you specify. Rasterizing and oversampling will both increase the file size for a variety of reasons. --oversample 300 would be 25% of the size and probably acceptable in quality.

v7.3 includes significant improvements for PDFs that contain a mix of graphics and vector objects. The --redo-ocr in the same version may also help.

I am guessing from the screenshots you provided that you work with files that need to be kept confidential and this quite understandable. If you are interested, we could sign a confidential and consulting agreement that would allow me to work directly with your files and give more precise advice.

jbarlow83 avatar Nov 27 '18 10:11 jbarlow83

Thanks a lot, the below commands worked for me.

sudo ocrmypdf --verbose 1 --skip-text -l eng --output-type pdf --rotate-pages --rotate-pages-threshold 1 v_1.0.pdf v_1.1.pdf

sudo ocrmypdf --verbose 1 --skip-text -l eng --output-type pdf v_1.1.pdf v_1.2.pdf

sudo ocrmypdf --verbose 1 --skip-text -l eng --output-type pdf v_1.2.pdf v_1.3.pdf

sudo ocrmypdf --verbose 1 --skip-text -l eng --output-type pdf v_1.3.pdf v_1.4.pdf

Now, one issue we came across which is few text in the PDFs are not searchable. Could you please guide us on how to resolve that issue?

Again Thanks a lot in advance.

DEEPAK-KESWANI avatar Dec 16 '18 18:12 DEEPAK-KESWANI

Hi @jbarlow83, I'm working with @DEEPAK-KESWANI on this issue. I'm following up with the attached PDF that shows the various issues where text is not being picked up as searchable. This document is a consolidation of different unrelated pages merged together to help show a variety of the forms and scan outputs. The red arrows indicate many (but not all) of the words that are not being correctly picked by and converted into searchable text after the OCRmyPDF process completes. As reference, hard copies are being scanned in as B&W 300 DPI.

You'll note that some of the text not being pick up has no or little background noise while other text does have noise or background shading. Still, the OCR results are inconsistent. For example, the word "workflow" with black background is not picked up on page 1, but the same word with black background is picked up on page 2. For reference, when using another OCR tool to convert the documents like OmniPage Pro, the issue are much less -- still some missed words here and there, but the majority of the items marked with red arrows are searchable.

Many thanks in advance for helping us to better understand and troubleshoot this big problem we're facing with searchable text.

searchable-text-issue1-1-1.pdf

ajab21 avatar Dec 17 '18 20:12 ajab21

I suspect you may not using the most recent version of Tesseract. Using Tesseract v4.0, many of the "red arrow" areas are recognized. Note the improvement on the third page in particular. _.pdf As far as I can tell, the issue is not a failure of ocrmypdf to mark up recognized text found by Tesseract (although if you find examples of that, I will certainly address it). Although, I think you're working in a context where high accuracy is important and even with 4.0 it may not be enough.

Please understand that I did not write the Tesseract OCR engine used in OCRmyPDF. (The engine was written by Ray Smith and his team at Google.) OCRmyPDF rasterizes a PDF pages to images using Ghostscript, uses Tesseract to perform OCR, and then merges the OCR results back into the original PDF. OCRmyPDF manages this process, taking care of many details that are difficult to get right in a format as complex as PDF. This means, if Tesseract is "a black box" and when Tesseract cannot find all of the text in an image, that is fundamentally a Tesseract problem.

What we can do is help Tesseract out by modifying the image we send to it, and I have some features in ocrmypdf that do this. For example, I can tell you if a table border intersects a character, Tesseract will generally fail to recognize that character. So if we can filter the table border out of the image, that will improve accuracy. Another option is to train Tesseract for your data. This can help when Tesseract is finding blocks of text correctly, but recognition accuracy is poor.

That is about as much as I can offer in the way of public advice. If you need further assistance, then we could consider a consulting agreement, because that would allow me to get into the unique specifics of the issues you're facing.

jbarlow83 avatar Dec 19 '18 23:12 jbarlow83

@jbarlow83 Many thanks! We'll double check the version of Tesseract and get back findings/results. Yes, I see the large majority of the words noted with red arrows are being picked up the version you posted back. That's a tremendous improvement and very promising to see!

Near perfection is not necessary in the business environment at hand (meaning some words not being picked up is not a big issue and can be tolerated). However, we were dealing with a situation where far too many words were not being picked up to the extent that search reliability was inadequate, and that was a major problem. The improvement shown in your version would bring things to an acceptable/tolerable/feasible level, although we'd like to see how to improve even more, if possible, for the few words still not being picked up.

That said, I note that the majority of words still not being picked up in the version you shared appear to have background shading. Is there a setting in Tesseract that you're aware of or known issue that addresses this problem with background shading. Wondering if it's related at all to this same comment you posted under #289. Or, is there a config option with OCRmyPDF that would help here without rasterizing the original images?

Of note, we previously ran into major image degradation issues (i.e. text distortion) when we had OCRmyPDF rasterize new images for clean, deskew, etc. I believe we may have been using --force-ocr in addition to other options, and we've since taken out --force-ocr, too, per the above thread. So, we made the decision to avoid modifying the original scanned images when OCR processes are run given the resulting images were not of acceptable quality. Trade-off with this decision was less searchable text in some cases in exchange for original image quality being retained. It was just that in some cases, the rasterization was really poor, and searchability is not as important as maintaining accurate records (i.e. the images themselves). I'm wondering if you believe there could have been another underlying issue here that may address/improve that problem (i.e. rasterization causing image degradation). Or, perhaps there's another option other than Ghostscript that you'd recommend we consider. Of course, the best of both worlds is what's desired here -- that is, as many searchable terms as possible without modifying the original images. If only we could remove background pixels, filter out table borders, etc. for OCR text extraction purposes without actually changing the underlying images in the final output.

Many thanks for your excellent guidance and direction! Truly valuable. If there's a way to send you a donation/contribution, I'd be happy to do so.

ajab21 avatar Dec 20 '18 03:12 ajab21

@jbarlow83 Thanks a lot for your inputs on this. Really appreciated.

Tesseract version is 4.0 in our environment.

Thanks.

DEEPAK-KESWANI avatar Dec 20 '18 03:12 DEEPAK-KESWANI

@jbarlow83 I saw this _.pdf file and found the size of this file is 5 times than the original file. Can you share the commands you executed to get _.pdf file?

Thanks.

DEEPAK-KESWANI avatar Dec 20 '18 16:12 DEEPAK-KESWANI

Hi @jbarlow83 just checking back to see if you have addt'l guidance to offer on my last reply and also Deepak's where he mentioned the file size in the example you sent back was 5 times larger than the original. Many thanks in advance!

ajab21 avatar Jan 07 '19 17:01 ajab21

@DEEPAK-KESWANI I just did ocrmypdf -f ~searchable-text-issue1-1-1.pdf _.pdf. I was just replicating you did, to see how Tesseract 4 changed the results.

ocrmypdf --redo-ocr... will produce a smaller file. That's the version of the command you should use here as outlined further up.

(By the way you shouldn't need to use sudo ocrmypdf; this will also protect you if you try to use a malicious PDF.)

jbarlow83 avatar Jan 09 '19 00:01 jbarlow83

@ajab21 Tesseract does indeed have a known issue with background shading. The underlying issue is that it converts all inputs to 1-bit monochrome first using a thresholder, and currently this thresholder gets confused by high contrast backgrounds. I have a workaround in the form of the --threshold argument, which performs a different type of thresholding before the image is presented to Tesseract.

Page 4 of your document is still quite problematic because the background noise is high contrast, and there is gray text on white background that is lighter than the background itself. Zoomed in: screen shot 2019-01-08 at 17 16 42 There is no threshold (0..255) between black (0) and white (255) that will divide all of the pixels into foreground and background. It would need an algorithm fine-tuned to this particular problem.

If you are able to rescan at a higher resolution and in full color you should see an improvement. Color adds another dimension to add to assist with feature separation.

I do very much appreciate your generous offer of a contribution. You may send it via PayPal to [email protected]. If need a receipt or other transfer option please contact me at the same email address.

jbarlow83 avatar Jan 09 '19 01:01 jbarlow83

@jbarlow83 just sent a contribution via PayPal. Hope it helps the cause. We really appreciate your guidance!

Regarding the dilemma with Page 4, we're not having the same issues when running OCR using a different program like OmniPage Professional (desktop app). Scan quality is the same. We're using B&W 300 DPI, and that is the max we can do to manage file size. It's never been a problem in the past for the several years we've been using OmniPage. Color, while better quality, adds an enormous amount of file size that's not sustainable for performance reasons plus sharp increase in storage usage (and related costs).

Can you help share an example of the full command that employs the --threshold argument to improve the results on Page 4? The only reference I could find of this argument under the Documentation site was under the v7.3.0 release notes, but I didn't see how parameter values could be passed.

Also, please clarify if --threshold will still allow for the original images to remain intact in the final output without rasterization. As mentioned earlier, we were finding image quality was declining a great deal with each successive rasterization of a document as it gets versioned in Alfresco Share (ECM). So, ultimately, we decided not to rasterize or try to improve image quality at all using OCRmyPDF as the legibility and preservation of the original scanned or uploaded images is most important, with optimum OCR search capabilities being second place. Trying to figure out how to strike a balance here for improved OCR results without changing the original images in any way or increasing file size significantly.

On the file size topic, I'm confused why the previous command you shared ocrmypdf -f ~searchable-text-issue1-1-1.pdf _.pdf resulted in 5 times larger file size compared to the original. Could you help explain why this would happen? Also, is -f argument a typo? If not, could you hep clarify what this does or point me to the related documentation?

As reference, here is the current command we have been using for the very first version 1.0 of any new files uploaded into Alfresco: ocrmypdf --verbose 1 --skip-text -l eng --output-type pdf --rotate-pages --rotate-pages-threshold 1 v_1.0.pdf v_1.1.pdf

And, for each successive version after the first version (i.e. version 1.1 and beyond), we use the following command: ocrmypdf --verbose 1 --skip-text -l eng --output-type pdf v_1.1.pdf v_1.2.pdf

There's addt'l background why we auto-rotate the first version and NOT any others afterwards, but I'll spare the full reasoning. Basically, it is a workaround due to a different issue we came across related to the third-party annotation tool being used. And, we decided to employ the --skip-text argument in the commands so OCR would not be re-run unnecessarily where text already exists (to help with performance issues and overall processing time). Of course, we need to re-evaluate the command based on your guidance, but just wanted to explain why this is the current baseline.

Many thanks in advanced for helping us get over the hump here!!

ajab21 avatar Mar 06 '19 22:03 ajab21

Thank for your contribution, it is very much appreciated.

Regarding the dilemma with Page 4, we're not having the same issues when running OCR using a different program like OmniPage Professional (desktop app). [...]

For a problem like OCR, you need to convert the input to a monochrome image because text fundamentally has a distinct foreground and background. The color or shade of text has no special meaning. We wouldn't want an OCR engine to get confused about trying to find meaning in that.

The algorithm that separates foreground from background is the thresholder.

The problem we're having is that thresholders don't do well on noisy inputs. OmniPage still has to deal with this, it just does the job better.

Can you help share an example of the full command that employs the --threshold argument to improve the results on Page 4?

There are no arguments; it's just an on switch.

ocrmypdf --threshold -f searchable-text-issue1-1-1.pdf _out.pdf

(I realize now it's confusingly named since a threshold implies a parameter.) What it actually means is "enable OCRmyPDF's thresholder" instead of "use Tesseract's thresholder". OCRmyPDF's thresholder is just "Otsu's algorithm on a normalized background" as implemented in Leptonica. I believe it does better than Tesseract's for most inputs, but yours is a challenge for both thresholders.

please clarify if --threshold will still allow for the original images to remain intact in the final output without rasterization.

Absolutely, the original remains intact. That argument tells ocrmypdf to prepare a special image for tesseract that is easier to handle in OCR.

why the previous command you shared ocrmypdf -f ~searchable-text-issue1-1-1.pdf _.pdf resulted in 5 times larger file size compared to the original...

As I understand your use case, --skip-text is the option you should be using.-f is the --force-ocr argument and you may not need it all.

It tells ocrmypdf to rasterize all pages to an image and OCR them, discarding the original file. The default behavior is to graft an OCR layer onto the original. Since the file in question already had OCR, I was using this command to see how well ocrmypdf performed on the input images you have. (The command exists for exactly this kind of situation, when we just want to see what the OCR engine will do.)

The reason the file size increases in this case is that the annotations you added (the arrows pointing to problematic OCR, etc.), are vector objects. When you ask me to rasterize the entire file to one image per page, I need to pick the least common denominator of color space and DPI which can reasonably represent the original. For vector content, I render the whole page at 400 DPI to ensure this accurately represented, even though the input image is 150 DPI.


Where this all leaves us is: unfortunately, both ocrmypdf and Tesseract currently have no good option for dealing with "Page 4".

Starting from here, we have text on a very noisy background:

image

But as discussed Tesseract first uses its thresholder to convert the image to pure black and white. This is what Tesseract "sees" - completely useless.

image

This is what Tesseract will see when OCRmyPDF uses its improved thresholder (--threshold):

image

So for this image my thresholder is better, but still not good enough for Page 4.

In an image editor I tried using a median filter followed by thresholding. (Just a quick experiment, not coded.) The result is:

image

So I suspect if I add a median filter to my thresholder we may get usable results for you. You could also experiment with the function select_ocr_image in _pipeline.py, which is where we apply the ocrmypdf thresholder.

jbarlow83 avatar Mar 07 '19 08:03 jbarlow83

@jbarlow83 Thanks so much for the great explanation. This particular example we're looking at is an extreme one, which represents close to the worst case scenario. So, I'm wondering if we use --threshold in general for less extreme examples, then it ought to provide some meaningful level of improvement for other characters. We'll give it a shot and see.

About the example with arrows, sorry about that. I should have sent the example pre-OCR without arrows and just the image layer. We've played with --force-ocr in the past and found it problematic, so we decided to back that out. Makes sense now that you've clarified -f is the same as --force-ocr why the file size increased so much while actually improving the OCR output. Unfortunately, this trade-off isn't one we can go for as image rasterization landed us in major no-man's land with our particular documents becoming totally illegible after a few version updates (plus the increased file size -- not 5 times but still significantly more than the original).

So I suspect if I add a median filter to my thresholder we may get usable results for you. You could also experiment with the function select_ocr_image in _pipeline.py, which is where we apply the ocrmypdf thresholder.

About, playing around with this function in _pipeline.py, we'll hold off til we see if --threshold makes a meaningful enough improvement for the majority of cases. If not, then we may come back to this to experiment more.

Thanks again!!!

ajab21 avatar Mar 08 '19 01:03 ajab21

Good evening,

I'm experiencing a similar problem but I have a conceptional question: why is OCRmyPDF changing the image output at all? I thought it would not be the case as I read it in the readme:

Keeps the exact resolution of the original embedded images

My case is the following: I have a long screenshot (webpage) that I cut in many pieces (via Pillow - loseless): after this operation the png is looking like this:

image

After that, I convert it in PDF and the output looks the following:

image

And then I OCRmyPDF the file:

subprocess.run(["ocrmypdf", "-l", "eng+deu+fra", "--threshold", "../pdfs/yourfile.pdf", "../pdfs/mvp.pdf"])

and I get some noise around the letters (it does the same without threshold):

image

Also the size of the pdf went from 2.3MB to 812KB but I would have preferred no compression at all...

I'm I missing something?

lolobosse avatar Apr 26 '20 20:04 lolobosse

Use --optimize 0 and --output-type pdf to disable and decompression.

Image resolution never changes by default but recompression can occur.

On Sun., Apr. 26, 2020, 13:30 Laurent Meyer, [email protected] wrote:

Good evening,

I'm experiencing a similar problem but I have a conceptional question: why is OCRmyPDF changing the image output at all? I thought it would not be the case as I read it in the readme:

Keeps the exact resolution of the original embedded images

My case is the following: I have a long screenshot (webpage) that I cut in many pieces (via Pillow - loseless): after this operation the png is looking like this:

[image: image] https://user-images.githubusercontent.com/5024077/80318713-511c0f80-880c-11ea-8d0b-c30c1c887bde.png

After that, I convert it in PDF and the output looks the following:

[image: image] https://user-images.githubusercontent.com/5024077/80318734-698c2a00-880c-11ea-8a6e-440c6593b79b.png

And then I OCRmyPDF the file:

subprocess.run(["ocrmypdf", "-l", "eng+deu+fra", "--threshold", "../pdfs/yourfile.pdf", "../pdfs/mvp.pdf"])

and I get some noise around the letters (it does the same without threshold):

[image: image] https://user-images.githubusercontent.com/5024077/80318756-97716e80-880c-11ea-9029-bb357fb3e672.png

Also the size of the pdf went from 2.3MB to 812KB but I would have preferred no compression at all...

I'm I missing something?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jbarlow83/OCRmyPDF/issues/316#issuecomment-619620464, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN5YMYNRV3JPOPYYQMPC43ROSKXLANCNFSM4GGJNCTA .

jbarlow83 avatar Apr 26 '20 20:04 jbarlow83