pdf-reader icon indicating copy to clipboard operation
pdf-reader copied to clipboard

extracted text does not match text of pdf

Open pblesi opened this issue 11 years ago • 2 comments

reader.pages.at(3).text produces this output:

• FAX/Scanner/Copiers • 2 Digital Cameras • 1 Cisco Router • Hub

however text contained when pdf is rendered is:

4 FAX/Scanner/Copiers 2 Digital Cameras 1 Cisco Router 1 Hub

As you can see the numbers for 2 of the elements in the list are missing.

It appears I cannot include the pdf file, but the raw content for this page is:

/C2_0 1 Tf 0 Tc 0 Tw 12 0 0 12 97.2 186.9 Tm <0078>Tj /TT2 1 Tf -0.0004 Tc 0.0026 Tw 0.46 0 Td [( )-760(2 Poly Com systems )]TJ ET EMC /P <</MCID 29 >>BDC BT /C2_0 1 Tf 0 Tc 0 Tw 12 0 0 12 97.2 172.26 Tm <0078>Tj /TT2 1 Tf -0.0002 Tc 0.7624 Tw 0.46 0 Td [( 4 )760(FAX/Scanner/Copiers )]TJ ET EMC /P <</MCID 30 >>BDC BT /C2_0 1 Tf 0 Tc 0 Tw 12 0 0 12 97.2 157.68 Tm <0078>Tj /TT2 1 Tf -0.0002 Tc 0.0024 Tw 0.46 0 Td [( )-760(2 Digita)-4(l)2( Cameras )]TJ ET EMC /P <</MCID 31 >>BDC BT /C2_0 1 Tf 0 Tc 0 Tw 12 0 0 12 97.2 143.04 Tm <0078>Tj /TT2 1 Tf -0.0002 Tc 0.0024 Tw 0.46 0 Td [( )-760(1 Cisco Router )]TJ ET EMC /P <</MCID 32 >>BDC BT /C2_0 1 Tf 0 Tc 0 Tw 12 0 0 12 97.2 128.46 Tm <0078>Tj /TT2 1 Tf -0.0014 Tc 0.7636 Tw 0.46 0 Td [( 1 )760(Hub )]TJ ET EMC /P <</MCID 33 >>BDC BT /C2_0 1 Tf 0 Tc 0 Tw 12 0 0 12 97.2 113.82 Tm <0078>Tj /TT2 1 Tf -0.0004 Tc 0.0026 Tw 0.46 0 Td [( )-760(6 NEC projectors mounted on portable carts )]TJ ET EMC

pblesi avatar Jan 07 '14 18:01 pblesi

Did you find a solution for this? I believe I'm facing a similar issue.

aarmora avatar Jul 21 '15 13:07 aarmora

I suspect this is an issue with our text layout algorithms in the PageLayout class.

Unfortunately I'm short on time at the moment, but I'll happily accept patches if you want to investigate further,

yob avatar Feb 14 '17 14:02 yob