pikepdf icon indicating copy to clipboard operation
pikepdf copied to clipboard

[Question] How to convert named destinations into corresponding page numbers?

Open andrei-volkau opened this issue 3 years ago • 15 comments

Let me consider the following snippet.

from pikepdf import Pdf

path = "example.pdf"

with Pdf.open(path) as pdf:
    outline = pdf.open_outline()
    for title in outline.root:
        print(title)
        for subtitle in title.children:
            print('\t', subtitle)

It gives the following output.

[ ] Preface -> FM_10002138957
[ ] Contents -> FM_20002138957
[ ] Contributors -> FM_30002138957
[-] 1: Foundations -> Chapterb978-3-319-06910-4_1
	 [-] 1.1 What Is Law? -> HeadingsSec10002138943
	 [-] 1.2 Roman Law -> HeadingsSec70002138943
	 [-] 1.3 Common Law -> HeadingsSec120002138943
	 [ ] 1.4 Ius Commune -> HeadingsSec160002138943
	 [-] 1.5 National States and Codification -> HeadingsSec170002138943
	 [ ] 1.6 Legal Families -> HeadingsSec200002138943
	 [-] 1.7 From National to Transnational Laws -> HeadingsSec210002138943
	 [ ] Recommended Literature -> HeadingsBib10002138943

example.pdf Such entities like ”FM_10002138957”, “Chapterb978-3-319-06910-4_1” are called named destinations. I was not able to find info on how to convert named destinations into page numbers so far.

Question How to convert the named destinations into corresponding page numbers?

andrei-volkau avatar Nov 19 '20 11:11 andrei-volkau

For a bit of background, PDF has two fundamental storage elements, dictionaries and arrays. On top of these, the PDF spec defines some tree-based data structures, a collection of dictionaries/arrays that are organized in a particular way. If you haven't obtained a copy of the PDF Reference Manual 1.7 (free download) you'll need it to make sense of what I'm describing.

The basic page number is inferred by its position in pdf.pages. (Technical detail: The actual data structure in your PDF is a page tree, which pikepdf transparently treats as an array.)

The named destinations contain a reference to the page itself, rather than its number. You could iterate through the pages to check for matching pointers.

Named destinations are normally stored in pdf.Root.Names.Dests, in a name tree data structure, a fairly complex B-tree-like data structure. They are overkill in most cases, so a lot of PDF generator agree and just create a single node in the tree and put all of the storage elements in that node, in which case it just looks like an array of [key1 val1 key2 val2] in sorted order. (In the general case, you'd have to consider the name tree could have multiple nodes and you'd have to explore them to find your key.) I plan to add support for name trees to make them easier to use.

In the case of a flat name tree you would do:

named_dest = 'FM_20002138957'  # given
names = pdf.Root.Names.Dests.Names
for n in range(0, len(names), 2):
    k = names[n]
    v = names[n + 1]
    if k == named_dest:
        named_page = v[0]  # this is the page object being referred to 
        break

for n, page in enumerate(pdf.pages):
     if page == named_page:
         page_index_number = n  # this is the page index number
         break

print(page_index_number)  # 7 with your file

Now to make matters more complicated still, your PDF defines custom page labels in pdf.Root.PageLabels. This data structure is a number tree that maps page_index_number to the page number that is shown to the user. The number tree does have multiple nodes.

jbarlow83 avatar Nov 19 '20 19:11 jbarlow83

@jbarlow83 I got the point regarding the named destinations. Many thanks for the details!

I am wondering how to deal with documents like the following ones. example_2.pdf example_3.pdf example_4.pdf

Those documents seem to be not having pdf.Root.Names.Dests.Names, but they still have bookmarks.

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-6-2eb48b625e5a> in <module>
----> 1 names = pdf.Root.Names.Dests.Names
      2 names

AttributeError: /Names

andrei-volkau avatar Nov 20 '20 09:11 andrei-volkau

example_2.pdf does not have pdf.Root.Names at all. Rather than using named destinations, it uses explicit destinations, which are more common. These directly refer to the page object in question.

I didn't look at the others but presumably it's the same situation.

jbarlow83 avatar Nov 20 '20 21:11 jbarlow83

@jbarlow83 Could you share a code snippet regarding explicit destinations, please? I mean converting explicit destinations into corresponding page numbers.

andrei-volkau avatar Nov 21 '20 13:11 andrei-volkau

The explicit destination is an array, and the first item in the array is an indirect reference to the page. Then you have named_page as in the code above, and can use the same method to obtain the page number.

jbarlow83 avatar Nov 21 '20 23:11 jbarlow83

Sorry for hijacking the thread, but I think my case is closely reletad even though I have actions instead of named destinations.

with pdf.open_outline() as outline:
    for i in list(outline.root):
        print(i)
        for j in i.children:
            print("  ", j)
[ ] Címlap -> <Action>
[ ] Köszöntő -> <Action>
[ ] Tartalom -> <Action>
[-] Anime -> <Action>
   [ ] Made in Abyss -> <Action>
   [ ] Great Pretender -> <Action>
   [ ] Scumbag System -> <Action>

PDF

I plan to add support for name trees to make them easier to use.

Is it possible to ad an API to the outline with "automatic" page number resolution? I think I'm not the only one who just want to get sonething like a mutool show some.pdf outline output...

pintergreg avatar Nov 23 '20 15:11 pintergreg

As of v2.2.0 you can use pikepdf.Page(the_page).label to look up the label associated with a page, which is a step towards exposing this information in other areas.

jbarlow83 avatar Nov 30 '20 21:11 jbarlow83

@pintergreg the actions often (e.g. when generated by pdflatex?) have a action["/D"] value that is the destination name. But I haven't found a way to resolve this name yet...

kno10 avatar Apr 14 '21 10:04 kno10

A code snipped working for me, but likely not at all robust (built using hints from #173):

# Note: I want page numbers starting at 1 instead of 0
pagemap = dict([(page.to_json(), n + 1) for n, page in enumerate(pdf.pages)])
destmap = dict()
for k in pdf.Root.Names.Dests.Kids:
  for i in range(0,len(k.Names),2):
    pno = pagemap.get(k.Names[i+1].D[0].to_json(), -1)
    destmap[k.Names[i]] = pno

for i in outline.root[0].children:
  print(i.title, destmap.get(i.action.D, -1), sep="\t")
  for c in i.children:
    print(c.title, destmap.get(c.action.D, -1), sep="\t")

probably to_json isn't the "proper" way to do this, but it works for me. Also the outline may have more levels.

kno10 avatar Apr 14 '21 11:04 kno10

Is it possible to add an API to the outline with "automatic" page number resolution? I think I'm not the only one who just want to get sonething like a mutool show some.pdf outline output...

Surely you're not the only one ;) I've been struggling with resolving bookmarks to page numbers, too, and I tried to take everything into account that was mentioned in this thread, but my code kept not working for some test documents.

With PyMuPDF, in contrast, it's super easy to read bookmarks: You can simply use the getToC() method to obtain a list of (level, title, page) tuples.

mara004 avatar Jun 09 '21 10:06 mara004

I tried resolving bookmarks again and wanted to share what I'd come up with so far. PyMuPDF getToC() is considerably faster than my get_pikepdf_toc() method, however.

# SPDX-FileCopyrightText: 2021 mara004
# SPDX-License-Identifier: MPL-2.0

import pikepdf


def analyse_item(item, names, pdf):
    
    bookmark = None
    resolved = None
    
    dest = item.destination
    act = item.action
    #print(type(dest), type(act))
    
    if dest is not None:
        ref = dest
    elif act is not None:
        actkeys = act.keys()
        if ('/D' in actkeys) and ('/S' in actkeys) and (act['/S'] == "/GoTo"):
            ref = act.D
    else:
        assert False
    
    # direct destination
    if ref._type_name == 'array':
        #print("array")
        resolved = ref[0]
    # named destination
    elif ref._type_name == 'string':
        #print("string")
        #print(ref)
        if names is not None:
            #print(len(names))
            for n in range(0, len(names)-1, 2):
                #print(names[n]._type_name)
                if names[n] == ref:
                    #print(names[n+1]._type_name)
                    if names[n+1]._type_name == 'array':
                        named_page = names[n+1][0]
                    elif names[n+1]._type_name == 'dictionary':
                        named_page = names[n+1].D[0]
                    resolved = named_page
                    break
    
    pagenum = None
    if resolved is not None:
        for i, p in enumerate(pdf.pages):
            if resolved == p:
                pagenum = i
                break
        #print(f"-> ** resolved: '{item.title}' -> {pagenum} **")
        if pagenum is not None:
            bookmark = (item.title, pagenum)
    #else:
        #print(f"unresolvable: '{item.title}'")
    #print()
    return bookmark


def analyse_children(children, names, pdf):
    bookmarks = []
    for item in children:
        analysed = analyse_item(item, names, pdf)
        if analysed is not None:
            bookmarks.append(analysed)
        if item.children:
            bookmarks.extend(analyse_children(item.children, names, pdf))
    return bookmarks


def get_names(obj):
    names = []
    ks = obj.keys()
    if '/Names' in ks:
        names.extend(obj.Names)
    elif '/Kids' in ks:
        for k in obj.Kids:
            names.extend(get_names(k))
    else:
        assert False
    return names

def has_nested_key(obj, keys):
    ok = True
    to_check = obj
    for key in keys:
        if key in to_check.keys():
            to_check = to_check[key]
        else:
            ok = False
            break
    return ok

def get_pikepdf_toc(pdf):
    #print(pdf.Root.keys())
    if has_nested_key(pdf.Root, ['/Names', '/Dests']):
        names = get_names(pdf.Root.Names.Dests)
    else:
        names = None
    
    with pdf.open_outline() as outline:
        bookmarks = analyse_children(outline.root, names, pdf)
    
    return bookmarks

mara004 avatar Jun 12 '21 16:06 mara004

What's still missing in my above code is the level, but this could be added easily with the recursion depth of analyse_children()

mara004 avatar Jul 25 '21 11:07 mara004

@mara004 When I try running your code I get a address boundary error (segmentation fault) on enumeration of pdf.pages when p.obj is accessed. There was was an error on checking the value of p telling me to use `p.obj' instead. It worked for a of my books, but my electronic copy of The C++ Programming Language, 4th Edition fails. Guess I'll look into PyMuPDF.

BioBox avatar Jun 22 '22 02:06 BioBox

@BioBox My above code is a bit outdated. Now that qpdf provides a function to resolve destinations to page numbers, I think it would be a lot easier if someone could just add bindings for that to pikepdf.

mara004 avatar Jun 22 '22 09:06 mara004

@mara004 I already achieved what I was trying to do with MuPDF right before you make that post. But thanks anyway; I wouldn't have been able to do it without your help!

Here's what I was trying to do: https://gist.github.com/BioBox/bf7aa9279a16bd5b8c8d9335989e1324

BioBox avatar Jun 27 '22 23:06 BioBox