pdfalto icon indicating copy to clipboard operation
pdfalto copied to clipboard

Letters are parsed separately

Open timlac opened this issue 4 years ago • 1 comments

When parsing the attached pdf file the letters are parsed as separate entities.

<TextBlock ID="p1_b3" HPOS="39.3769" VPOS="39.7526" HEIGHT="468.203" WIDTH="16.0604"> <TextLine WIDTH="12.3016" HEIGHT="67.7031" ID="p1_t3" HPOS="39.3769" VPOS="39.7526"> <String ID="p1_w5" CONTENT="K" HPOS="39.3769" VPOS="39.7526" WIDTH="12.3016" HEIGHT="67.7031" STYLEREFS="font1"/> </TextLine> <TextLine WIDTH="10.6126" HEIGHT="67.7031" ID="p1_t4" HPOS="51.3563" VPOS="39.7526"> <String ID="p1_w6" CONTENT="o" HPOS="51.3563" VPOS="39.7526" WIDTH="10.6126" HEIGHT="67.7031" STYLEREFS="font1"/> </TextLine> <TextLine WIDTH="16.0604" HEIGHT="67.7031" ID="p1_t5" HPOS="62.2325" VPOS="39.7526"> <String ID="p1_w7" CONTENT="m" HPOS="62.2325" VPOS="39.7526" WIDTH="16.0604" HEIGHT="67.7031" STYLEREFS="font1"/> </TextLine> <TextLine WIDTH="16.0604" HEIGHT="67.7031" ID="p1_t6" HPOS="78.5565" VPOS="39.7526"> <String ID="p1_w8" CONTENT="m" HPOS="78.5565" VPOS="39.7526" WIDTH="16.0604" HEIGHT="67.7031" STYLEREFS="font1"/> </TextLine> <TextLine WIDTH="10.5345" HEIGHT="67.7031" ID="p1_t7" HPOS="94.8806" VPOS="39.7526"> </TextLine> <TextLine WIDTH="10.5345" HEIGHT="67.7031" ID="p1_t8" HPOS="105.678" VPOS="39.7526"> <String ID="p1_w10" CONTENT="n" HPOS="105.678" VPOS="39.7526" WIDTH="10.5345" HEIGHT="67.7031" STYLEREFS="font1"/> </TextLine>

Note that the file has watermark type text plastered across, using Apache PDFBox to remove this plastered text does not change the outcome.

kommers_annons_elite_original.pdf noBigText-kommers_annons_elite.pdf

timlac avatar Apr 21 '20 10:04 timlac

Hello @timlac ! Thank you for the issue. Normally fixed with PR #116 - I don't see any more separated letters in your error case.

kermitt2 avatar Apr 04 '21 20:04 kermitt2