OCRmyPDF icon indicating copy to clipboard operation
OCRmyPDF copied to clipboard

Add support for PDF/A-2u or PDF/A-2a

Open frederictobiasc opened this issue 4 years ago • 1 comments

Hi, I'm wondering if it would be benefiting to make the OCR text layer compatible to the specified PDF/A-2u standard. Since I couldn't find an issue covering this topic, I would like to ask if somebody already thought about this.

frederictobiasc avatar Apr 10 '20 11:04 frederictobiasc

PDF/A-2u is possible. Ghostscript does not have the ability to generate files with this feature, but it would be possible to test if the output conforms and promote it. With --force-ocr the output would always conform. Otherwise the output would conform if and only if all fonts have a valid Unicode mapping, which is not an easy test to implement.

2a is not possible, as this implies that detailed, user-generated "tagging" on the meaning of text (this is a heading, this is a paragraph, this is an image and the description of the image is as follows) and proper reading order. This requires a complex GUI. It is actually rather difficult to generate a 2a even with appropriate tools. I do not believe I have ever seen one "in the wild" - the only ones I have ever seen are examples for use in test suites.

jbarlow83 avatar Apr 10 '20 11:04 jbarlow83