jekyll Lesson Maintenance Issue: Working with batches of PDF files

A reader has got in touch to report the following:

--

• The full title of the lesson: Working with batches of PDF files • The system you are using (Mac, Linux, Windows): Linux (EndeavourOS/Arch) • Version numbers of the relevant software you are using: ocrmypdf v13.4.0 • The exact steps you took that caused the problem:

The command taken from here: https://programminghistorian.org/en/lessons/working-with-batches-of-pdf-files#text-recognition-in-pdf-files

find . -name '*.pdf' -exec ocrmypdf --language eng --deskew --clean '{}' '{}' \;

used to work for me perfectly but since some weeks ago it does not work anymore and I get this error: error2501

The command still works, if I leave the --clean part out:

find . -name '*.pdf' -exec ocrmypdf --language eng --deskew '{}' '{}' \;

I don't know if this is an issue on my end or if it was caused by an ocrmypdf update and the command needs updating.

--

Feb 24 '22 19:02 anisa-hawes

Notes: I have worked through these steps and encounter an error when running either command:

$ find . -name '*.pdf' -exec ocrmypdf --language eng --deskew --clean '{}' '{}' \; $ find . -name '*.pdf' -exec ocrmypdf --language eng --deskew '{}' '{}' \;

From my side, in both cases the error reads:

PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR; see also help for the arguments --skip-text and --redo-ocr

This is listed as a common error here https://ocrmypdf.readthedocs.io/en/latest/errors.html and is said to occur where the file(s) we are running ocrmypdf on already contain printable text or hidden OCR text layers. This is confusing to me, because the lesson says that "OCRmyPDF automatically skips PDFs that already contain embedded text".

I'm working on a macOS BigSur v.11.5.2, rather than Linux but am also using ocrmypdf v13.4.0.

Feb 25 '22 12:02 anisa-hawes

I was the person who emailed this bug report in a while ago. Now in OCRmyPDF 13.4.2-1 everything works again for me. So I guess the issue was on OCRmyPDFs end. I think this issue can be closed.

I can use find . -name '*.pdf' -exec ocrmypdf --language eng --deskew --clean '{}' '{}' \; once again including the --clean part.

Apr 06 '22 15:04 moritz-john

Dear @moritz-john,

Welcome! Thank you for contributing.

Following on from my most recent email, I am noting here that I am actually continuing to encounter errors when I run the first two commands in this tutorial.

In my most recent experiments with ocrmypdf v.13.4.0, I retrieve errors similar to those that can be expected, and are explained in the lesson, e.g., [tesseract] lots of diacritics - possibly poor OCR and Some input metadata could not be copied because it is not permitted in PDF/A.

Screenshot 2022-04-08 at 17 15 21

Since updating to v.13.4.2, I now retrieve [tesseract] Empty page!! and PriorOcrFoundError: page already has text!

Screenshot 2022-04-08 at 17 57 57

Apr 08 '22 17:04 anisa-hawes

Note to self that this thread https://stackoverflow.com/questions/55704218/how-to-check-if-pdf-is-scanned-image-or-contains-text/59098700#59098700 might offer me some clues.

Apr 08 '22 17:04 anisa-hawes

Another note: I understand from the tutorial that ocrmypdf automatically skips PDFs that already contain embedded text.

And, I have been able to successfully move onwards and extract the embedded texts from all the sample PDF files provided in the lesson.

So, perhaps @moritz-john is right and the Issue here is resolved by the latest update to software... 🤔

Apr 08 '22 18:04 anisa-hawes

@anisa-hawes can I help resolve this issue?

Sep 08 '22 01:09 hawc2

Thank you for offering support @hawc2. I apologise for the long delay in following up here.

I've had in my mind that in the process of reviewing a separate lesson, one of our outside collaborators, Ryan Cordell, alerted me to a further issue with this lesson.

Ryan explained that:

Essentially, newer versions of OS X seem to run into problems installing ImageMagick—the homebrew installation instructions outlined in this lesson kept throwing errors, particularly an X11 error “delegate library support not built in (X11).” I found that error message all over stackoverflow and similar sites. I’m looking back over these many threads in my history trying to remember which one specifically solved my issue—I should have taken better notes, but I hadn’t even made it to the lesson I was reviewing yet! Aas I recall, I had to install X11 and/or Xquartz separately because they no longer ship with OSX and then use a specific installation flag brew install imagemagick --with-x11 to get the right version of ImageMagick.

That’s all very confusing and I can’t replicate it anymore, since I’ve updated whatever needed updating at the time, so I guess my suggestion would be to find someone with a newer installation of OS X and see if they can install things the way the lesson asks?

So I need to take another look. I work on a Mac, although I do not have the latest OS.

Sep 09 '22 11:09 anisa-hawes

I just tested it on MacOS BigSur and the installation worked fine.

I got the same error message you received @anisa-hawes, but I think it's just a warning. As it says, you can also add the argument --force-ocr, which seems to work.

This lesson will likely face a bunch of sustainability issues. It may benefit from a revisit and upgrade, but this particular bug report can probably be closed.

Sep 09 '22 21:09 hawc2

jekyll jekyll copied to clipboard

Lesson Maintenance Issue: Working with batches of PDF files

jekyll
jekyll copied to clipboard