Wishlist - Ability to identify different methods of producing OCR'ed PDF from scanned document
The understanding at the National Archives and Records Administration is that there are three ways to scan a document into the PDF format. One is to simply put the scan into the PDF, all bits included. The other two methods involve adding OCR'ed data to the file. They are "Searchable Image - Exact" and "Formatted Text and Graphics and PDF Normal".
Is there a way to determine programatically on a page by page basis which method was used for any given scan?
The "Formatted Text and Graphics and PDF Normal" version of the data is not acceptable as a permanent record because it discards image data and replaces it with, sometimes imperfect, OCR'ed data.
It would be very helpful in the archival world to be able to analyze a PDF file and determine which of the three methods the file was created with. Yes, we have seen federal agency data which appear to have all three methods in the same file!
Thanks.
The understanding at the National Archives and Records Administration is that there are three ways to scan a document into the PDF format.
Actually, there are MANY ways to scan a document into a PDF. But they can be broken down into three categories.
- Image only. Where the scanned images are turned into PDF pages without modification (modulo format and/or compression transcoding) and no-OCR is performed.
- Image + Hidden Text. Builds on "image only" by adding text (most likely from an OCR process) to the PDF page (either above or below the image in the Z-order of the graphics objects)
- Converted content (aka "Formatted Text and Graphics") - A completely different method where the scanned image is analyzed and individual portions are reconstructed into "native" PDF graphics objects - text, vectors, rasters.
However, each of these has various "sub-categories" based on various techniques for improving compression ratios, searchability, etc. For example, there is the use of MRC to segment an image into areas that can be compressed using the optimal techniques or the use of "Searchable Vectors".
Yes, we have seen federal agency data which appear to have all three methods in the same file!
Yes, that is perfectly reasonable and not uncommon! Remember that each page of a PDF is "separate", which is why even things like page sizes are per-page and not per-document. Combining pages/documents is one of the top three "edits" performed on PDFs.
Is there a way to determine programatically on a page by page basis which method was used for any given scan?
The content of the page is easily evaluated through various heuristics to determine their characteristics. However, since many operations can be done post-scan, possibly by later processes, it doesn't necessary reflect on the original scanning process - just on what you have at that moment.
Note: this is not a PDF file format specification issue related to ISO 32000-2:2020
I think the "Image + Hidden Text" category indeed needs some help. The "Image Only" category is pretty self-descriptive, and for the "Converted content" the current AFRelationship mechanism is enough to reveal its relationship with the source image data.
The status quo of "Image + Hidden Text" approach is not good enough for accessibility though. Imagine a screen reader that shipped built-in OCR functionalities processes an "Image + Hidden Text" PDF file, it will reveal to the user that there's two pieces of information on the page, one is the "Hidden Text" data, while the other is the text OCR-ed on-the-fly from the image. The two pieces information will interfere with each other, but the AT cannot unify them with confidence since it has no knowledge at all of the former OCR process. It would be nice if there's standard practice that identify the "Hidden Text" so AT could behave better accordingly.
The PDF/UA TWG reviewed this request and determined to draft a formal PDF Association Application Note providing guidance on this subject. In addition to stand-alone guidance on the general subject of scanned documents in PDF this work will be used to inform improvements to the "Tagged PDF Best Practice Guide: Syntax", also published by the PDF Association.
We will update this issue once the Application Note is developed.