Getting Hyperlink Text
Hi! I'm having trouble getting the text related to a hyperlink in my PDF. By that, I mean that I have some text in my PDF, say SampleHyperlinkHere that when clicked opens another PDF. I'm able to get the PDF attached to the hyperlink using this script https://gist.github.com/danlucraft/5277732#gistcomment-2675302, but I want to be able to link which attachment comes from which text.
For example I have this page with 16 Annots:
irb(main):475:0> annots = page.attributes[:Annots]
=> #<PDF::Reader::Reference:0x00007fed3d4f1780 @id=59, @gen=0>
irb(main):476:0> objects = page.objects[annots]
=> [#<PDF::Reader::Reference:0x00007fed3d421350 @id=60, @gen=0>, #<PDF::Reader::Reference:0x00007fed3d420e78 @id=61, @gen=0>, #<PDF::Reader::Reference:0x00007fed3d420cc0 @id=62, @gen=0>, #<PDF::Reader::Reference:0x00007fed3d420b08 @id=63, @gen=0>, #<PDF::Reader::Reference:0x00007fed3d420950 @id=64, @gen=0>, #<PDF::Reader::Reference:0x00007fed3d420478 @id=65, @gen=0>, #<PDF::Reader::Reference:0x00007fed3d4202c0 @id=66, @gen=0>, #<PDF::Reader::Reference:0x00007fed3d420108 @id=67, @gen=0>, #<PDF::Reader::Reference:0x00007fed3d42bbe8 @id=68, @gen=0>, #<PDF::Reader::Reference:0x00007fed3d42ba30 @id=69, @gen=0>, #<PDF::Reader::Reference:0x00007fed3d42b878 @id=70, @gen=0>, #<PDF::Reader::Reference:0x00007fed3d42b3a0 @id=71, @gen=0>, #<PDF::Reader::Reference:0x00007fed3d42b1e8 @id=72, @gen=0>, #<PDF::Reader::Reference:0x00007fed3d42b030 @id=73, @gen=0>, #<PDF::Reader::Reference:0x00007fed3d42ae78 @id=74, @gen=0>, #<PDF::Reader::Reference:0x00007fed3d42a978 @id=75, @gen=0>]
irb(main):477:0> objects.count
=> 16
and I notice that 8 of those are links (as expected) and I'm able to grab the attachment from the link for those 8 just fine.
irb(main):478:0> objects.each do |o|
irb(main):479:1* puts page.objects[o]
irb(main):480:1> end
{:A=>#<PDF::Reader::Reference:0x00007fed3d532c08 @id=391, @gen=0>, :Border=>[0, 0, 0], :Type=>true, :Subtype=>:Link, :Rect=>[296.7999, 702.28, 357.6099, 713.28]}
{:Type=>false, :Subtype=>:Line, :BS=>#<PDF::Reader::Reference:0x00007fed3d53b560 @id=384, @gen=0>, :Rect=>[295.2999, 700.78, 359.1099, 703.78], :AP=>#<PDF::Reader::Reference:0x00007fed3d53a570 @id=385, @gen=0>, :"AAPL:AKExtras"=>#<PDF::Reader::Reference:0x00007fed3d53a228 @id=386, @gen=0>, :C=>[0, 0, 1], :IC=>[0, 0, 1], :F=>4, :L=>[296.7999, 702.28, 357.6099, 702.28]}
{:Type=>false, :Subtype=>:Line, :BS=>#<PDF::Reader::Reference:0x00007fed3d542c98 @id=377, @gen=0>, :Rect=>[238.0749, 688.65, 301.845, 691.65], :AP=>#<PDF::Reader::Reference:0x00007fed3d541d98 @id=378, @gen=0>, :"AAPL:AKExtras"=>#<PDF::Reader::Reference:0x00007fed3d541b40 @id=379, @gen=0>, :C=>[0, 0, 1], :IC=>[0, 0, 1], :F=>4, :L=>[239.5749, 690.15, 300.345, 690.15]}
{:Type=>false, :Subtype=>:Line, :BS=>#<PDF::Reader::Reference:0x00007fed3d54aa88 @id=370, @gen=0>, :Rect=>[88.525, 584.9, 152.3049, 587.9], :AP=>#<PDF::Reader::Reference:0x00007fed3d549c28 @id=371, @gen=0>, :"AAPL:AKExtras"=>#<PDF::Reader::Reference:0x00007fed3d5499d0 @id=372, @gen=0>, :C=>[0, 0, 1], :IC=>[0, 0, 1], :F=>4, :L=>[90.025, 586.4, 150.8049, 586.4]}
{:Type=>false, :Subtype=>:Line, :BS=>#<PDF::Reader::Reference:0x00007fed3d552828 @id=363, @gen=0>, :Rect=>[135.3999, 597.03, 199.21, 600.03], :AP=>#<PDF::Reader::Reference:0x00007fed3d551978 @id=364, @gen=0>, :"AAPL:AKExtras"=>#<PDF::Reader::Reference:0x00007fed3d5516f8 @id=365, @gen=0>, :C=>[0, 0, 1], :IC=>[0, 0, 1], :F=>4, :L=>[136.8999, 598.53, 197.71, 598.53]}
{:Type=>false, :Subtype=>:Line, :BS=>#<PDF::Reader::Reference:0x00007fed3d55a5a0 @id=356, @gen=0>, :Rect=>[325.8299, 517.38, 389.98, 520.38], :AP=>#<PDF::Reader::Reference:0x00007fed3d5596f0 @id=357, @gen=0>, :"AAPL:AKExtras"=>#<PDF::Reader::Reference:0x00007fed3d559498 @id=358, @gen=0>, :C=>[0, 0, 1], :IC=>[0, 0, 1], :F=>4, :L=>[327.3299, 518.88, 388.48, 518.88]}
{:Type=>false, :Subtype=>:Line, :BS=>#<PDF::Reader::Reference:0x00007fed3d5623e0 @id=349, @gen=0>, :Rect=>[416.07, 461.76, 479.85, 464.76], :AP=>#<PDF::Reader::Reference:0x00007fed3d561558 @id=350, @gen=0>, :"AAPL:AKExtras"=>#<PDF::Reader::Reference:0x00007fed3d561300 @id=351, @gen=0>, :C=>[0, 0, 1], :IC=>[0, 0, 1], :F=>4, :L=>[417.57, 463.26, 478.35, 463.26]}
{:Type=>false, :Subtype=>:Line, :BS=>#<PDF::Reader::Reference:0x00007fed3d56a220 @id=342, @gen=0>, :Rect=>[153.287, 386.3599, 217.0671, 389.3599], :AP=>#<PDF::Reader::Reference:0x00007fed3d569398 @id=343, @gen=0>, :"AAPL:AKExtras"=>#<PDF::Reader::Reference:0x00007fed3d569140 @id=344, @gen=0>, :C=>[0, 0, 1], :IC=>[0, 0, 1], :F=>4, :L=>[154.787, 387.8599, 215.5671, 387.8599]}
{:Type=>false, :Subtype=>:Line, :BS=>#<PDF::Reader::Reference:0x00007fed3d572010 @id=335, @gen=0>, :Rect=>[313.82, 330.85, 377.6, 333.85], :AP=>#<PDF::Reader::Reference:0x00007fed3d571160 @id=336, @gen=0>, :"AAPL:AKExtras"=>#<PDF::Reader::Reference:0x00007fed3d570f08 @id=337, @gen=0>, :C=>[0, 0, 1], :IC=>[0, 0, 1], :F=>4, :L=>[315.32, 332.35, 376.1, 332.35]}
{:A=>#<PDF::Reader::Reference:0x00007fed3d57a648 @id=334, @gen=0>, :Border=>[0, 0, 0], :Type=>true, :Subtype=>:Link, :Rect=>[239.5749, 690.15, 300.345, 701.15]}
{:A=>#<PDF::Reader::Reference:0x00007fed3d583d38 @id=333, @gen=0>, :Border=>[0, 0, 0], :Type=>true, :Subtype=>:Link, :Rect=>[90.025, 586.4, 150.8049, 597.4]}
{:A=>#<PDF::Reader::Reference:0x00007fed3d5814c0 @id=332, @gen=0>, :Border=>[0, 0, 0], :Type=>true, :Subtype=>:Link, :Rect=>[136.8999, 598.53, 197.71, 609.53]}
{:A=>#<PDF::Reader::Reference:0x00007fed3d58ab88 @id=331, @gen=0>, :Border=>[0, 0, 0], :Type=>true, :Subtype=>:Link, :Rect=>[327.3299, 518.88, 388.48, 529.88]}
{:A=>#<PDF::Reader::Reference:0x00007fed3d588310 @id=330, @gen=0>, :Border=>[0, 0, 0], :Type=>true, :Subtype=>:Link, :Rect=>[417.57, 463.26, 478.35, 474.26]}
{:A=>#<PDF::Reader::Reference:0x00007fed3d591a00 @id=329, @gen=0>, :Border=>[0, 0, 0], :Type=>true, :Subtype=>:Link, :Rect=>[191.9299, 387.8599, 252.71, 398.8599]}
{:A=>#<PDF::Reader::Reference:0x00007fed3d59b140 @id=328, @gen=0>, :Border=>[0, 0, 0], :Type=>true, :Subtype=>:Link, :Rect=>[315.32, 332.35, 376.1, 343.35]}
Is there a way to use the other 8 annotations to get the text associated with the hyperlinks, or another way that I'm missing? Appreciate the help!
I'm not super familiar with the annotation options. However, my guess us the 8 Line annotations won't have any text associated with them. I also suspect that that text for the hyperlink is just part of the standard content stream of the page, and the Link annotations define an invisible annotation that sits on top of the text to handle clicks.
In theory it'd be possible to grab the Rect attribute from the 8 Link annotations, and then fetch only the text from the page that sits within those boundaries. Unfortunately pdf-reader doesn't offer a nice API to do that. You'd have to create a customised version of PDF::Reader::PageTextReceiver.