extracted text does not match text of pdf
reader.pages.at(3).text produces this output:
• FAX/Scanner/Copiers • 2 Digital Cameras • 1 Cisco Router • Hub
however text contained when pdf is rendered is:
4 FAX/Scanner/Copiers 2 Digital Cameras 1 Cisco Router 1 Hub
As you can see the numbers for 2 of the elements in the list are missing.
It appears I cannot include the pdf file, but the raw content for this page is:
/C2_0 1 Tf 0 Tc 0 Tw 12 0 0 12 97.2 186.9 Tm <0078>Tj /TT2 1 Tf -0.0004 Tc 0.0026 Tw 0.46 0 Td [( )-760(2 Poly Com systems )]TJ ET EMC /P <</MCID 29 >>BDC BT /C2_0 1 Tf 0 Tc 0 Tw 12 0 0 12 97.2 172.26 Tm <0078>Tj /TT2 1 Tf -0.0002 Tc 0.7624 Tw 0.46 0 Td [( 4 )760(FAX/Scanner/Copiers )]TJ ET EMC /P <</MCID 30 >>BDC BT /C2_0 1 Tf 0 Tc 0 Tw 12 0 0 12 97.2 157.68 Tm <0078>Tj /TT2 1 Tf -0.0002 Tc 0.0024 Tw 0.46 0 Td [( )-760(2 Digita)-4(l)2( Cameras )]TJ ET EMC /P <</MCID 31 >>BDC BT /C2_0 1 Tf 0 Tc 0 Tw 12 0 0 12 97.2 143.04 Tm <0078>Tj /TT2 1 Tf -0.0002 Tc 0.0024 Tw 0.46 0 Td [( )-760(1 Cisco Router )]TJ ET EMC /P <</MCID 32 >>BDC BT /C2_0 1 Tf 0 Tc 0 Tw 12 0 0 12 97.2 128.46 Tm <0078>Tj /TT2 1 Tf -0.0014 Tc 0.7636 Tw 0.46 0 Td [( 1 )760(Hub )]TJ ET EMC /P <</MCID 33 >>BDC BT /C2_0 1 Tf 0 Tc 0 Tw 12 0 0 12 97.2 113.82 Tm <0078>Tj /TT2 1 Tf -0.0004 Tc 0.0026 Tw 0.46 0 Td [( )-760(6 NEC projectors mounted on portable carts )]TJ ET EMC
Did you find a solution for this? I believe I'm facing a similar issue.
I suspect this is an issue with our text layout algorithms in the PageLayout class.
Unfortunately I'm short on time at the moment, but I'll happily accept patches if you want to investigate further,