pypdf
pypdf copied to clipboard
Support for outline item external references
Explanation
I'm not sure if this is a request for a new feature or documentation to explain how this is already possible...
My knowledge of PDF internal format is microscopic, but I know that PDF supports internal links (to images, pages, etc.) and external links (web pages, other files, email addresses, ...) don't see how pypdf supports external links.
Here's my situation: I have a PDF file (from a CD I purchased) that has outline links to pages and external files. It's a scan of a book, almost 1200 pages, so the links to sections of the document are quite handy. Trouble is, the pages are all just images. It would be very useful to be able to search for text and copy text for use elsewhere. (Fair use, of course.)
Yes, I know there are resources that OCR scan PDF files, but everything I've tried balks at a file that large, at least without a charge.
So I:
- Split the big file into 100 page chunks.
- OCR scanned each chunk.
- Merged the scanned chunks back into a single file.
Which worked perfectly. Except, while the text in the result is all nicely scanned, the outline is gone. So, I'm using pypdf to merge the original document's outline into the scanned document. And that works fine for the outline options that are just headers, and links to pages within the document, but the external links are gone.
See code example below. This is just the inner logic to deal with a single outline entry, obviously there's outer logic to deal with lists and embedded lists.
Code Example
Here's what I'm doing now:
from pypdf import PdfReader, PdfWriter
# Setup is basically this:
from_file = PdfReader(open(ORIGINAL_FILE, 'rb'))
scanned_file = PdfReader(open(SCANNED_FILE, 'rb'))
to_file = PdfWriter()
to_file.append_pages_from_reader(scanned_file)
# so at this point, from_file has the desired outlines, and
# to_file has all of the OCR scanned pages but no outlines.
# (Or much of anything else.)
# Then follows loops to apply Destinations from from_file.outline to to_file.
# Omitting the looping logic, each destination is handled as:
pgno = from_file.get_destination_page_number(outline)
if pgno is None:
next_parent = to_file.add_outline_item_dict(outline, parent=parent_outline)
else:
next_parent = to_file.add_outline_item(outline.title, page_number=pgno, parent=parent_outline)
# next_parent becomes parent_outline for embedded lists.
# This works fine for references to pages, but external references are lost.
# They just become an item in the outline, but they don't behave like
# in the original document.
So the question is: Is this something that can be done with the current release, but it's too obscure for me to figure out? Or would it be a useful addition in the future? Said feature probably would need a way to tell if an existing outline entry was an external reference, plus a way to specify such a reference in a new file.
Though now that I think of it, outlines can point to other internal things like images. Maybe those are IndirectObjects
so already supported?