pikepdf icon indicating copy to clipboard operation
pikepdf copied to clipboard

Read page numbers from outline "action"

Open oicerid opened this issue 3 years ago • 7 comments

I'm trying to figure out how to extract the page number of an OutlineItem when <Action> is returned, since this feature doesn't seem to be implemented yet (?).

Is there a workaround until it's implemented? Any idea of when that might be?

Is it possible to access the "raw" Pdf outline somehow to look for the /Page entry?

Take outlines.pdf as an example:

from pikepdf import Pdf
reader = Pdf.open("outlines.pdf")

with reader.open_outline() as outlines:
    for outline in outlines.root:
        print(outline)

returns:

[+] One -> <Action>
[ ] Two -> <Action>
[+] Three -> <Action>

oicerid avatar Feb 21 '22 20:02 oicerid

Well, each object that you get when iterating over the outline root is an OutlineItem and you may directly access its action dictionary (item.action), which usually has a /D key containing a destination (can be direct or indirect). Assuming it is direct, you'll get an array of a page object, a page location type and between 0 to 4 coordinates. The page index may then be determined using pikepdf.Page(direct_dest[0]).index. (If it's an indirect destination, things would become more complicated.)

I'm trying to figure out how to extract the page number of an OutlineItem when <Action> is returned, since this feature doesn't seem to be implemented yet (?).

I believe that libqpdf recently added QPDFOutlineObjectHelper::getDestPage() and some other useful methods releated to parsing the PDF table of contents, but (as far as I can see) pikepdf doesn't have bidings for it yet. Technically, you could of course implement a bookmark page resolver manually using the means pikepdf currently provides, but depending on your needs this may be rather cumbersome.

(In the meantime, I can also suggest using pymupdf.Document.get_toc() if you don't mind the AGPL3.)

mara004 avatar Feb 22 '22 13:02 mara004

Thanks for the answer.

In the above example item.action returns NotImplementedError: don't know how to __str__ this object - so it doesn't seem to be possible to access anything in that case. Is that because it's an indirect destination?

I'm currently using PdfFileReader.getDestinationPageNumber() from PyPDF2 to get this info, but since it's not maintained I felt it was time to try and convert to something else.

Will have a look at qpdf and pymupdf and if it may be an option.

oicerid avatar Feb 22 '22 23:02 oicerid

Yes, the outline code unfortunately doesn't handle actions at this time, only outline entries explicitly defined with a page destinations. Actions can be a lot of things other than going to a page.

jbarlow83 avatar Feb 23 '22 05:02 jbarlow83

In the above example item.action returns NotImplementedError: don't know how to __str__ this object - so it doesn't seem to be possible to access anything in that case. Is that because it's an indirect destination?

That the object doesn't implement __str__ does not mean it can't be accessed. If you wish to print the action, I think you need to use print(repr(item.action)). That said, it should be possible to work with the action as with any other PDF dictionary. For example, you could do something like this:

if '/D' in item.action:
    dest = item.action.D
    # assuming a direct destination
    assert isinstance(dest, pikepdf.Array)
    page_obj = dest[0]
    page_index = pikepdf.Page(page_obj).index
    print(page_index)

mara004 avatar Feb 23 '22 13:02 mara004

That said, it should be possible to work with the action as with any other PDF dictionary. For example, you could do something like this:

if '/D' in item.action:
    dest = item.action.D
    # assuming a direct destination
    assert isinstance(dest, pikepdf.Array)
    page_obj = dest[0]
    page_index = pikepdf.Page(page_obj).index
    print(page_index)

Thanks, haven't really understood all of how to work with pikepdf yet but this is atleast one step closer :)

After trying your code snippet I can conclude that its not a direct destination but an indirect one. When printing repr(item.action) I get:

pikepdf.Dictionary({
  "/D": "0",
  "/S": "/GoTo"
})

Is it possible to look up the "/D" value somewhere within pikepdf in this case? Cause I'm guessing it can be resolved to a "/Page" entry somewhere.

oicerid avatar Feb 23 '22 18:02 oicerid

Is it possible to look up the "/D" value somewhere within pikepdf in this case? Cause I'm guessing it can be resolved to a "/Page" entry somewhere.

Yes, it should be possible to resolve the indirect/named destination to a direct one. I suppose the document has a name tree at pdf.Root.Names.Dests which can basically be used like a dictionary to map from indirect to direct destinations, thanks to the NameTree support model of pikepdf/qpdf:

named_dest = item.action.D
assert isinstance(named_dest, pikepdf.Dictionary)
name_mapping = pikepdf.NameTree(pdf.Root.Names.Dests)
direct_dest = name_mapping[named_dest]
page = pikepdf.Page(direct_dest[0])
print(page_obj.index)

mara004 avatar Feb 24 '22 11:02 mara004

Thank you so much for your help! Will test your code when I get a chance, seems to support indirect destinations? If it doesnt work I atleast now know where to look :)

oicerid avatar Feb 26 '22 15:02 oicerid