pikepdf
pikepdf copied to clipboard
[Question] How to convert named destinations into corresponding page numbers?
Let me consider the following snippet.
from pikepdf import Pdf
path = "example.pdf"
with Pdf.open(path) as pdf:
outline = pdf.open_outline()
for title in outline.root:
print(title)
for subtitle in title.children:
print('\t', subtitle)
It gives the following output.
[ ] Preface -> FM_10002138957
[ ] Contents -> FM_20002138957
[ ] Contributors -> FM_30002138957
[-] 1: Foundations -> Chapterb978-3-319-06910-4_1
[-] 1.1 What Is Law? -> HeadingsSec10002138943
[-] 1.2 Roman Law -> HeadingsSec70002138943
[-] 1.3 Common Law -> HeadingsSec120002138943
[ ] 1.4 Ius Commune -> HeadingsSec160002138943
[-] 1.5 National States and Codification -> HeadingsSec170002138943
[ ] 1.6 Legal Families -> HeadingsSec200002138943
[-] 1.7 From National to Transnational Laws -> HeadingsSec210002138943
[ ] Recommended Literature -> HeadingsBib10002138943
example.pdf Such entities like ”FM_10002138957”, “Chapterb978-3-319-06910-4_1” are called named destinations. I was not able to find info on how to convert named destinations into page numbers so far.
Question How to convert the named destinations into corresponding page numbers?
For a bit of background, PDF has two fundamental storage elements, dictionaries and arrays. On top of these, the PDF spec defines some tree-based data structures, a collection of dictionaries/arrays that are organized in a particular way. If you haven't obtained a copy of the PDF Reference Manual 1.7 (free download) you'll need it to make sense of what I'm describing.
The basic page number is inferred by its position in pdf.pages
. (Technical detail: The actual data structure in your PDF is a page tree, which pikepdf transparently treats as an array.)
The named destinations contain a reference to the page itself, rather than its number. You could iterate through the pages to check for matching pointers.
Named destinations are normally stored in pdf.Root.Names.Dests
, in a name tree data structure, a fairly complex B-tree-like data structure. They are overkill in most cases, so a lot of PDF generator agree and just create a single node in the tree and put all of the storage elements in that node, in which case it just looks like an array of [key1 val1 key2 val2]
in sorted order. (In the general case, you'd have to consider the name tree could have multiple nodes and you'd have to explore them to find your key.) I plan to add support for name trees to make them easier to use.
In the case of a flat name tree you would do:
named_dest = 'FM_20002138957' # given
names = pdf.Root.Names.Dests.Names
for n in range(0, len(names), 2):
k = names[n]
v = names[n + 1]
if k == named_dest:
named_page = v[0] # this is the page object being referred to
break
for n, page in enumerate(pdf.pages):
if page == named_page:
page_index_number = n # this is the page index number
break
print(page_index_number) # 7 with your file
Now to make matters more complicated still, your PDF defines custom page labels in pdf.Root.PageLabels
. This data structure is a number tree that maps page_index_number
to the page number that is shown to the user. The number tree does have multiple nodes.
@jbarlow83 I got the point regarding the named destinations. Many thanks for the details!
I am wondering how to deal with documents like the following ones. example_2.pdf example_3.pdf example_4.pdf
Those documents seem to be not having pdf.Root.Names.Dests.Names
, but they still have bookmarks.
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-6-2eb48b625e5a> in <module>
----> 1 names = pdf.Root.Names.Dests.Names
2 names
AttributeError: /Names
example_2.pdf does not have pdf.Root.Names
at all. Rather than using named destinations, it uses explicit destinations, which are more common. These directly refer to the page object in question.
I didn't look at the others but presumably it's the same situation.
@jbarlow83 Could you share a code snippet regarding explicit destinations, please? I mean converting explicit destinations into corresponding page numbers.
The explicit destination is an array, and the first item in the array is an indirect reference to the page. Then you have named_page
as in the code above, and can use the same method to obtain the page number.
Sorry for hijacking the thread, but I think my case is closely reletad even though I have actions instead of named destinations.
with pdf.open_outline() as outline:
for i in list(outline.root):
print(i)
for j in i.children:
print(" ", j)
[ ] Címlap -> <Action>
[ ] Köszöntő -> <Action>
[ ] Tartalom -> <Action>
[-] Anime -> <Action>
[ ] Made in Abyss -> <Action>
[ ] Great Pretender -> <Action>
[ ] Scumbag System -> <Action>
I plan to add support for name trees to make them easier to use.
Is it possible to ad an API to the outline with "automatic" page number resolution? I think I'm not the only one who just want to get sonething like a mutool show some.pdf outline
output...
As of v2.2.0
you can use pikepdf.Page(the_page).label
to look up the label associated with a page, which is a step towards exposing this information in other areas.
@pintergreg the actions often (e.g. when generated by pdflatex?) have a action["/D"]
value that is the destination name. But I haven't found a way to resolve this name yet...
A code snipped working for me, but likely not at all robust (built using hints from #173):
# Note: I want page numbers starting at 1 instead of 0
pagemap = dict([(page.to_json(), n + 1) for n, page in enumerate(pdf.pages)])
destmap = dict()
for k in pdf.Root.Names.Dests.Kids:
for i in range(0,len(k.Names),2):
pno = pagemap.get(k.Names[i+1].D[0].to_json(), -1)
destmap[k.Names[i]] = pno
for i in outline.root[0].children:
print(i.title, destmap.get(i.action.D, -1), sep="\t")
for c in i.children:
print(c.title, destmap.get(c.action.D, -1), sep="\t")
probably to_json
isn't the "proper" way to do this, but it works for me. Also the outline may have more levels.
Is it possible to add an API to the outline with "automatic" page number resolution? I think I'm not the only one who just want to get sonething like a
mutool show some.pdf outline
output...
Surely you're not the only one ;) I've been struggling with resolving bookmarks to page numbers, too, and I tried to take everything into account that was mentioned in this thread, but my code kept not working for some test documents.
With PyMuPDF, in contrast, it's super easy to read bookmarks: You can simply use the getToC()
method to obtain a list of (level, title, page)
tuples.
I tried resolving bookmarks again and wanted to share what I'd come up with so far.
PyMuPDF getToC()
is considerably faster than my get_pikepdf_toc()
method, however.
# SPDX-FileCopyrightText: 2021 mara004
# SPDX-License-Identifier: MPL-2.0
import pikepdf
def analyse_item(item, names, pdf):
bookmark = None
resolved = None
dest = item.destination
act = item.action
#print(type(dest), type(act))
if dest is not None:
ref = dest
elif act is not None:
actkeys = act.keys()
if ('/D' in actkeys) and ('/S' in actkeys) and (act['/S'] == "/GoTo"):
ref = act.D
else:
assert False
# direct destination
if ref._type_name == 'array':
#print("array")
resolved = ref[0]
# named destination
elif ref._type_name == 'string':
#print("string")
#print(ref)
if names is not None:
#print(len(names))
for n in range(0, len(names)-1, 2):
#print(names[n]._type_name)
if names[n] == ref:
#print(names[n+1]._type_name)
if names[n+1]._type_name == 'array':
named_page = names[n+1][0]
elif names[n+1]._type_name == 'dictionary':
named_page = names[n+1].D[0]
resolved = named_page
break
pagenum = None
if resolved is not None:
for i, p in enumerate(pdf.pages):
if resolved == p:
pagenum = i
break
#print(f"-> ** resolved: '{item.title}' -> {pagenum} **")
if pagenum is not None:
bookmark = (item.title, pagenum)
#else:
#print(f"unresolvable: '{item.title}'")
#print()
return bookmark
def analyse_children(children, names, pdf):
bookmarks = []
for item in children:
analysed = analyse_item(item, names, pdf)
if analysed is not None:
bookmarks.append(analysed)
if item.children:
bookmarks.extend(analyse_children(item.children, names, pdf))
return bookmarks
def get_names(obj):
names = []
ks = obj.keys()
if '/Names' in ks:
names.extend(obj.Names)
elif '/Kids' in ks:
for k in obj.Kids:
names.extend(get_names(k))
else:
assert False
return names
def has_nested_key(obj, keys):
ok = True
to_check = obj
for key in keys:
if key in to_check.keys():
to_check = to_check[key]
else:
ok = False
break
return ok
def get_pikepdf_toc(pdf):
#print(pdf.Root.keys())
if has_nested_key(pdf.Root, ['/Names', '/Dests']):
names = get_names(pdf.Root.Names.Dests)
else:
names = None
with pdf.open_outline() as outline:
bookmarks = analyse_children(outline.root, names, pdf)
return bookmarks
What's still missing in my above code is the level, but this could be added easily with the recursion depth of analyse_children()
@mara004 When I try running your code I get a address boundary error (segmentation fault) on enumeration of pdf.pages
when p.obj
is accessed. There was was an error on checking the value of p
telling me to use `p.obj' instead. It worked for a of my books, but my electronic copy of The C++ Programming Language, 4th Edition fails. Guess I'll look into PyMuPDF.
@BioBox My above code is a bit outdated. Now that qpdf provides a function to resolve destinations to page numbers, I think it would be a lot easier if someone could just add bindings for that to pikepdf.
@mara004 I already achieved what I was trying to do with MuPDF right before you make that post. But thanks anyway; I wouldn't have been able to do it without your help!
Here's what I was trying to do: https://gist.github.com/BioBox/bf7aa9279a16bd5b8c8d9335989e1324