OCRmyPDF
OCRmyPDF copied to clipboard

Published 20 hours ago •

Reame
Issues

Option to remove blank pages

Open OCRmyPDF-issuebot opened this issue 9 years ago • 19 comments

Issue by drdownload Thu Oct 30 08:25:16 2014 Originally opened as https://github.com/fritz-hh/OCRmyPDF/issues/98

it would be great to have an option to remove blank pages. I scan a lot of images with my duplex scanner and not all scanned documents have a printed backside.

Sep 14 '15 01:09 OCRmyPDF-issuebot

Comment by eloops Wed Dec 10 11:49:27 2014

I've been modifying this for my own use, there is a specific program (here) called 'empty-page' which will return a 0 or 1 depending on if the page is blank or not. It works on pnm files as well as TIFF, and is fast.

I inserted the following code in ocrPage.sh just after the conversion to .pnm:

# check to see if image is a blank page ... if so, delete it
[ $VERBOSITY -ge $LOG_DEBUG ] && echo "Page $page: Detecting if blank page ..."
empty-page -i "$curImgPixmap" >/dev/null 2>&1
if [ $? -ne 1 ] && [ $KEEP_TMP -eq 0 ]; then
  [ $VERBOSITY -ge $LOG_DEBUG ] && echo "Page $page: Deleting blank page and moving on."
  rm -f "$curOrigImg"*
  rm -f "$curHocr"
  rm -f "$curImgPixmap"
  rm -f "$curImgPixmapDeskewed"
  rm -f "$curImgPixmapClean"
  rm -f "$curImgInfo"
  exit 0
fi

Sep 14 '15 01:09 OCRmyPDF-issuebot

Comment by zorglups Fri Mar 13 21:51:50 2015

My 2 cents. I scan most thing with a duplex scanner too and will implement things to remove the last page if it is blank.

As I want to keep my PDFs a replica of documents, intermediate blank pages should remain in my archiving use.

Sep 14 '15 01:09 OCRmyPDF-issuebot

Comment by Wikinaut Tue Sep 8 13:47:03 2015

Just another idea:

Another (secondary) empty-page detection decision could be based on Tesseract's text output for that page (e.g. number of detected characters or the like).

Sep 14 '15 01:09 OCRmyPDF-issuebot

Comment by Wikinaut Tue Sep 8 13:51:26 2015

and http://superuser.com/questions/343385/detecting-blank-image-files shows alternative solutions for blank page detection.

Sep 14 '15 01:09 OCRmyPDF-issuebot

@jbarlow83 thanks for an excellent tool and for sharing with the community. you :metal: Is this feature request still being worked on?

Feb 15 '17 14:02 modulexcite

I've discovered a script that uses ghostscript for blank page detection.

Maybe it will be of use to you and easily integrable into ocrmypdf.

Jan 09 '19 11:01 sojusnik

+1 for this one - currently I simply do a preprocessing for that.

Feb 13 '19 21:02 WillemJansen

@jbarlow83 Thanks too for this great tool. I used my own script since I was not aware of this amazing piece of software.

I would like to push this one with my ideas as well. Since I am not too familiar with python, pipeline and leptonica I am unfortunately unable to implement it by myself. But I would like to share my ideas about this, since all the basics are available:

An empty page could be detected by calculation the ratio of black pixels to white pixels. If the ratio is below e.g. 0.005, the page is considered as blank.

Calculating the ratio could be done with ghostscript's inkcov option as in the script above or even easier with leptonica: Use a 1bpp representation of the page and pixCountPixels to determine the black pixels, divided by the number of pixels of the page.

So the todo would be:

An option --remove-empty-pages to enable the page removal process
An option --empty-page-threshold 0.005 to be able to modify the threshold of the percentage of black pixels in the page
Create a task_remove_empty_page which calls the function to determine if the page is empty. Either the task removes the PNG or can stop the pipeline, that's where I don't know enough of ruffus magic. Maybe somthing similar like ocr_or_skip task.

Jul 12 '19 08:07 svenihoney

I'm close to releasing a new version most of which is in the api branch which could (should) hopefully make this sort of thing easier since there will be a sort of plugin interface where various steps can be customized externally. Although I don't yet have an extension point that would change the page count, but that's at least the place to look for targeting this change.

A separate issue is actually getting a good blank page detector. My scanner has a blank page detector based on a threshold but I've turned it off because it's so unreliable. I can tell you from experience that a single threshold counting based blank page detector is more trouble than it's worth on real documents due to false negatives (i.e. discarding a useful page that wasn't blank).

Looking at black and white only, the detector ought to put more weight on the center of a page being blanked and less on the margins or typical locations of hole punch and staples. A single mark like a page number at the bottom of a blank page should not cause removal. Paper with grainy texture will tend to scan with a lot of "salt and pepper" noise, but should still be considered black. Any thresholds need to be scaled reasonable for documents with large page sizes. And it should need work for grayscale and color pages consistently, with the unique cases that brings: bleed through from the previous page and very faded pages, and multiple colors being indicative of content.

It would be worth seeing if there's anything in the literature. I'd recommend starting there if you're interested in working on and seeing if we can get a good algorithm. Using unpaper is one possibility.

Jul 12 '19 18:07 jbarlow83

I have the case that OCRmyPDF is stopping when it detects a blank page, page 6 here:

➜ OCRmyPDF-LOG:

INFO - reading file from standard input
INFO - Start processing 2 pages concurrently
INFO - 2: page is facing ⇧, confidence 13.48 - no change
INFO - 1: page is facing ⇧, confidence 12.45 - no change
INFO - 3: page is facing ⇧, confidence 14.77 - rotation appears correct
INFO - 4: page is facing ⇧, confidence 16.16 - rotation appears correct
WARNING - 6: [tesseract] Warning. Invalid resolution 0 dpi. Using 70 instead.
INFO - 6: [tesseract] Too few characters. Skipping this page
ERROR - 6: [tesseract] Error during processing.
INFO - 6: page is facing ⇧, confidence 0.00 - no change
INFO - 5: page is facing ⇧, confidence 11.54 - no change
INFO - Optimize ratio: 1.00 savings: -0.1%
INFO - Image optimization did not improve the file - discarded
INFO - Output sent to stdout

← OCRmyPDF-LOG-END

any hints?

Sebastian

Mar 22 '20 16:03 enterframe

@enterframe That message simply says that too few characters were recognized on a particular page, so Tesseract assumed that none of them were valid. It did not stop process, it just did not find anything it was confident was text. It appears that the file was created successfully ("sent to stdout").

Mar 23 '20 06:03 jbarlow83

+1 on this

Feb 24 '21 07:02 disaster123

Does anybody have a workaround?

May 19 '21 07:05 disaster123

Just a thought:

When there will be a feature to detect blank pages it might come in handy to optionally replace a "physical blank" page with a "digital blank" page instead of removing it.

As @zorglups stated, there might be reasons to keep blank pages. Like for archiving purposes or to keep the order of even and odd pages when displaying them side by side in a pdf viewer.

The benefit of a "digital blank" page would be a much smaller file size (almost zero) compared to a "physical blank" page that might have a slightly gray shadow for example.

May 19 '21 18:05 CWempe

I agree - digital blank has significant advantages in most cases.

If only there were a reliable algorithm for blank page detection.... I think it may be a machine learning problem.

May 21 '21 08:05 jbarlow83

You can try my "Noora PDF" software project. It has AI inside, and I trained it on some scanned pages with punch holes. Maybe this will work for you: https://www.softpedia.com/get/Office-tools/PDF/Zautin-Simple-PDF-Watermark.shtml It is free, but you can Donate :)

Dec 03 '21 21:12 vlad12244

+1

Jan 13 '22 22:01 lecramr

+1

Jul 16 '22 23:07 patric-r

+1

May 13 '23 16:05 kidexx