pdf-reader icon indicating copy to clipboard operation
pdf-reader copied to clipboard

convenience methods - eg. extract named destinations

Open bwl21 opened this issue 5 years ago • 7 comments

I would like to extract a list of all named destinations in a pdf file to directly navigate to the position in the command line of the pdf-reader (e.g. HREF="http://www.example.com/myfile.pdf#glossary).

pypdf has a convenient method for this (https://unix.stackexchange.com/questions/246622/list-named-destinations-in-a-pdf)

could this be added to pdf-reader as well?

bwl21 avatar Jan 29 '20 13:01 bwl21

I'd be more than happy to see a convenience method for named destinations added.

I probably don't have time to add it myself, but I'm happy to review a PR.

yob avatar Jan 30 '20 11:01 yob

If you could give me a hint how I can access the named destination, I will propose a convenience method.

Up to now I did not find out, how I can get the named destination.

Am 30.01.2020 um 12:39 schrieb James Healy <[email protected] mailto:[email protected]>:

I'd be more than happy to see a convenience method for named destinations added.

I probably don't have time to add it myself, but I'm happy to review a PR.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/yob/pdf-reader/issues/325?email_source=notifications&email_token=AAPN4QFRGIF25IVLLF3K6QLRAK36ZA5CNFSM4KNESXF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKKVYCA#issuecomment-580213768, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAPN4QCEWV3RN565KNHMFY3RAK36ZANCNFSM4KNESXFQ.

bwl21 avatar Jan 30 '20 11:01 bwl21

Thanks for offering the contribute!

The implementation in pypdf shows some helpful clues: https://github.com/mstamy2/PyPDF2/blob/18a2627adac13124d4122c8b92aaa863ccfb8c29/PyPDF2/pdf.py#L1350-L1389

By coincidence, this spec file in the pdf-reader repo has some named destinations: spec/data/pdflatex.pdf.

This code fragment demonstrates the general approach to replicating the pypdf code in pdf-reader:

diff --git a/lib/pdf/reader.rb b/lib/pdf/reader.rb
index 0ac514b..6d7e830 100644
--- a/lib/pdf/reader.rb
+++ b/lib/pdf/reader.rb
@@ -206,6 +206,17 @@ module PDF
       PDF::Reader::Page.new(@objects, num, :cache => @cache)
     end
 
+    def named_destinations
+      names = root[:Names]
+      return {} if names.nil?
+
+      dests = @objects.deref(names)[:Dests]
+      return {} if dests.nil?
+
+      @objects.deref(dests)
+    end
+
+
     private

In terms of specs, I'd love to see a single new spec in spec/integration_spec.rb that confirms the full output of the method for spec/data/pdflatex.pdf. Maybe something roughly like this:

$ git diff spec/integration_spec.rb
diff --git a/spec/integration_spec.rb b/spec/integration_spec.rb
index 446373e..8ee51f1 100644
--- a/spec/integration_spec.rb
+++ b/spec/integration_spec.rb
@@ -1168,4 +1168,16 @@ describe PDF::Reader, "integration specs" do
       end
     end
   end
+
+  context "extracts named destinations" do
+    let(:filename) { pdf_spec_file("pdflatex") }
+
+    it "extracts text correctly" do
+      PDF::Reader.open(filename) do |reader|
+        expect(page.named_destinations).to eq({
+          :Foo => "Bar"
+        })
+      end
+    end
+  end
 end

yob avatar Jan 30 '20 11:01 yob

fine, I will do my best ...

bwl21 avatar Jan 30 '20 11:01 bwl21

Hi, I started to implement this - even if I don't exactly know what I am doing :-) I more or less ported the pypdf method. I have two questions:

  1. the pypdf method retrieves all named destinations. So shouldn't named_destinations be a method of Reader?

  2. I could not find out how I can get the text representing the destination. In pdf-reader there is no equivalent to the class Destination available in pypdf. So I do not really know what to return.

  3. pdflatex.pdf holds about 90 destinations. Wouldn't it be sufficient to expect the count of destinations and the details of one particular entry?

bwl21 avatar Jan 31 '20 08:01 bwl21

I started to implement this

great!

the pypdf method retrieves all named destinations. So shouldn't named_destinations be a method of Reader?

Yes. I'm not fully across named destinations, but my understanding is they're a document-level concept and not page level, so I think the method should go on the PDF::Reader class.

I could not find out how I can get the text representing the destination. In pdf-reader there is no equivalent to the class Destination available in pypdf. So I do not really know what to return.

hmm. I'm not familiar enough with destinations to know the answer off the top of my head. I'd suggest opening a PR with as much as you can get, and then hopefullly I can help fill in the gaps.

pdflatex.pdf holds about 90 destinations. Wouldn't it be sufficient to expect the count of destinations and the details of one particular entry?

Yes, your suggestion sounds fine.

yob avatar Jan 31 '20 11:01 yob

Yes. I'm not fully across named destinations, but my understanding is they're a document-level concept and not page level, so I think the method should go on the PDF::Reader class.

I have added the method to both classes. I also have opened the PR.

bwl21 avatar Jan 31 '20 11:01 bwl21