pdfminer icon indicating copy to clipboard operation
pdfminer copied to clipboard

pdf2txt.py creates duplicate of characters

Open yashodhan19 opened this issue 7 years ago • 3 comments

pdf2txt.py -t xml "path/to/input.pdf" > "/path/to/output.xml"

The xml output has duplicate values in the xml. They usually seem to be in the same text box but that is not always the case. But the duplicates have the exact coordinates. I tested it on other parsers and the output did not have any duplicates.

Here is a XML snippet from the output which has the duplicate

<textbox id="1" bbox="248.401,733.724,278.558,743.773"> <textline bbox="248.401,733.724,278.558,743.773"> <text font="PYBHHG+Helvetica-Bold" bbox="248.401,733.724,252.661,743.773" size="10.048">C</text> <text font="PYBHHG+Helvetica-Bold" bbox="252.661,733.724,257.251,743.773" size="10.048">O</text> <text font="PYBHHG+Helvetica-Bold" bbox="257.251,733.724,262.166,743.773" size="10.048">M</text> <text font="PYBHHG+Helvetica-Bold" bbox="262.166,733.724,266.102,743.773" size="10.048">P</text> <text font="PYBHHG+Helvetica-Bold" bbox="266.102,733.724,270.362,743.773" size="10.048">A</text> <text font="PYBHHG+Helvetica-Bold" bbox="270.362,733.724,274.622,743.773" size="10.048">N</text> <text font="PYBHHG+Helvetica-Bold" bbox="274.622,733.724,278.558,743.773" size="10.048">Y</text> <text> </text> </textline> <textline bbox="248.401,733.724,278.558,743.773"> <text font="PYBHHG+Helvetica-Bold" bbox="248.401,733.724,252.661,743.773" size="10.048">C</text> <text font="PYBHHG+Helvetica-Bold" bbox="252.661,733.724,257.251,743.773" size="10.048">O</text> <text font="PYBHHG+Helvetica-Bold" bbox="257.251,733.724,262.166,743.773" size="10.048">M</text> <text font="PYBHHG+Helvetica-Bold" bbox="262.166,733.724,266.102,743.773" size="10.048">P</text> <text font="PYBHHG+Helvetica-Bold" bbox="266.102,733.724,270.362,743.773" size="10.048">A</text> <text font="PYBHHG+Helvetica-Bold" bbox="270.362,733.724,274.622,743.773" size="10.048">N</text> <text font="PYBHHG+Helvetica-Bold" bbox="274.622,733.724,278.558,743.773" size="10.048">Y</text> <text> </text> </textline> </textbox>

yashodhan19 avatar Jun 26 '18 20:06 yashodhan19

I have the same problem and I don't know how to solve it

984958198 avatar Jun 29 '18 04:06 984958198

Is there a solution for this? I have been facing this issue constantly.

<textbox id="0" bbox="28.200,755.676,221.331,776.787">
<textline bbox="28.500,755.676,221.331,776.787">
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="28.500,755.676,36.883,776.787" size="21.110">F</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="36.902,755.676,41.288,776.787" size="21.110">i</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="41.104,755.676,51.107,776.787" size="21.110">n</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="51.306,755.676,60.424,776.787" size="21.110">a</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="60.309,755.676,70.312,776.787" size="21.110">n</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="70.512,755.676,78.311,776.787" size="21.110">c</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="78.314,755.676,82.700,776.787" size="21.110">i</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="82.515,755.676,91.633,776.787" size="21.110">a</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="91.518,755.676,95.904,776.787" size="21.110">l</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="95.719,755.676,99.802,776.787" size="21.110"> </text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="99.920,755.676,108.584,776.787" size="21.110">S</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="108.322,755.676,114.502,776.787" size="21.110">t</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="114.324,755.676,123.442,776.787" size="21.110">a</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="123.326,755.676,133.330,776.787" size="21.110">n</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="133.529,755.676,143.338,776.787" size="21.110">d</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="143.132,755.676,147.518,776.787" size="21.110">i</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="147.333,755.676,157.337,776.787" size="21.110">n</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="157.536,755.676,166.286,776.787" size="21.110">g</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="165.938,755.676,170.022,776.787" size="21.110"> </text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="170.140,755.676,180.186,776.787" size="21.110">R</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="180.342,755.676,189.395,776.787" size="21.110">e</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="189.345,755.676,199.154,776.787" size="21.110">p</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="198.948,755.676,208.562,776.787" size="21.110">o</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="208.550,755.676,215.335,776.787" size="21.110">r</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="215.152,755.676,221.331,776.787" size="21.110">t</text>
<text>
</text>
</textline>
<textline bbox="28.200,755.676,221.031,776.787">
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="28.200,755.676,36.583,776.787" size="21.110">F</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="36.602,755.676,40.988,776.787" size="21.110">i</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="40.803,755.676,50.807,776.787" size="21.110">n</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="51.006,755.676,60.124,776.787" size="21.110">a</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="60.009,755.676,70.012,776.787" size="21.110">n</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="70.212,755.676,78.011,776.787" size="21.110">c</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="78.014,755.676,82.400,776.787" size="21.110">i</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="82.215,755.676,91.333,776.787" size="21.110">a</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="91.217,755.676,95.604,776.787" size="21.110">l</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="95.419,755.676,99.502,776.787" size="21.110"> </text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="99.620,755.676,108.284,776.787" size="21.110">S</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="108.022,755.676,114.202,776.787" size="21.110">t</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="114.024,755.676,123.142,776.787" size="21.110">a</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="123.026,755.676,133.030,776.787" size="21.110">n</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="133.229,755.676,143.038,776.787" size="21.110">d</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="142.832,755.676,147.218,776.787" size="21.110">i</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="147.033,755.676,157.037,776.787" size="21.110">n</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="157.236,755.676,165.986,776.787" size="21.110">g</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="165.638,755.676,169.722,776.787" size="21.110"> </text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="169.839,755.676,179.886,776.787" size="21.110">R</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="180.042,755.676,189.095,776.787" size="21.110">e</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="189.045,755.676,198.854,776.787" size="21.110">p</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="198.647,755.676,208.262,776.787" size="21.110">o</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="208.250,755.676,215.034,776.787" size="21.110">r</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="214.852,755.676,221.031,776.787" size="21.110">t</text>
<text>
</text>
</textline>
</textbox>

seedhisadak avatar Apr 17 '20 08:04 seedhisadak

Any updates on this please? I have many examples in which the text is repeated, and we cannot remove the duplicates as the coordinates are slightly off (and hence no way of identifying the duplicates). I have tested with other parsers(like TETML) and the other parsers don't generate duplicates. I am happy to share the original pdf document if required. Sample output

<textline bbox="372.120,713.920,531.699,731.800">
<text font="DWLKVV+Roboto-Bold" bbox="372.120,713.920,379.860,731.800" size="17.881">0</text>
<text font="DWLKVV+Roboto-Bold" bbox="379.860,713.920,387.601,731.800" size="17.881">1</text>
<text font="DWLKVV+Roboto-Bold" bbox="387.601,713.920,393.547,731.800" size="17.881">/</text>
<text font="DWLKVV+Roboto-Bold" bbox="393.547,713.920,401.288,731.800" size="17.881">0</text>
<text font="DWLKVV+Roboto-Bold" bbox="401.288,713.920,409.028,731.800" size="17.881">3</text>
<text font="DWLKVV+Roboto-Bold" bbox="409.028,713.920,414.975,731.800" size="17.881">/</text>
<text font="DWLKVV+Roboto-Bold" bbox="414.975,713.920,422.715,731.800" size="17.881">2</text>
<text font="DWLKVV+Roboto-Bold" bbox="422.715,713.920,430.455,731.800" size="17.881">0</text>
<text font="DWLKVV+Roboto-Bold" bbox="430.455,713.920,438.196,731.800" size="17.881">2</text>
<text font="DWLKVV+Roboto-Bold" bbox="438.196,713.920,445.936,731.800" size="17.881">1</text>
<text font="DWLKVV+Roboto-Bold" bbox="445.936,713.920,449.294,731.800" size="17.881"> </text>
<text font="DWLKVV+Roboto-Bold" bbox="449.294,713.920,454.526,731.800" size="17.881">-</text>
<text font="DWLKVV+Roboto-Bold" bbox="454.526,713.920,457.883,731.800" size="17.881"> </text>
<text font="DWLKVV+Roboto-Bold" bbox="457.883,713.920,465.624,731.800" size="17.881">3</text>
<text font="DWLKVV+Roboto-Bold" bbox="465.624,713.920,473.364,731.800" size="17.881">1</text>
<text font="DWLKVV+Roboto-Bold" bbox="473.364,713.920,479.311,731.800" size="17.881">/</text>
<text font="DWLKVV+Roboto-Bold" bbox="479.311,713.920,487.051,731.800" size="17.881">0</text>
<text font="DWLKVV+Roboto-Bold" bbox="487.051,713.920,494.791,731.800" size="17.881">3</text>
<text font="DWLKVV+Roboto-Bold" bbox="494.791,713.920,500.738,731.800" size="17.881">/</text>
<text font="DWLKVV+Roboto-Bold" bbox="500.738,713.920,508.478,731.800" size="17.881">2</text>
<text font="DWLKVV+Roboto-Bold" bbox="508.478,713.920,516.219,731.800" size="17.881">0</text>
<text font="DWLKVV+Roboto-Bold" bbox="516.219,713.920,523.959,731.800" size="17.881">2</text>
<text font="DWLKVV+Roboto-Bold" bbox="523.959,713.920,531.699,731.800" size="17.881">1</text>
<text>
</text>
</textline>
<textline bbox="371.660,713.388,530.173,731.593">
<text font="OTSVAU+Roboto-Bold" bbox="371.660,713.388,379.527,731.593" size="18.206">0</text>
<text font="OTSVAU+Roboto-Bold" bbox="379.527,713.388,387.395,731.593" size="18.206">1</text>
<text font="OTSVAU+Roboto-Bold" bbox="387.395,713.388,392.516,731.593" size="18.206">/</text>
<text font="OTSVAU+Roboto-Bold" bbox="392.516,713.388,400.383,731.593" size="18.206">0</text>
<text font="OTSVAU+Roboto-Bold" bbox="400.383,713.388,408.250,731.593" size="18.206">3</text>
<text font="OTSVAU+Roboto-Bold" bbox="408.250,713.388,413.372,731.593" size="18.206">/</text>
<text font="OTSVAU+Roboto-Bold" bbox="413.372,713.388,421.239,731.593" size="18.206">2</text>
<text font="OTSVAU+Roboto-Bold" bbox="421.239,713.388,429.106,731.593" size="18.206">0</text>
<text font="OTSVAU+Roboto-Bold" bbox="429.106,713.388,436.974,731.593" size="18.206">2</text>
<text font="OTSVAU+Roboto-Bold" bbox="436.974,713.388,444.841,731.593" size="18.206">1</text>
<text font="OTSVAU+Roboto-Bold" bbox="444.841,713.388,448.260,731.593" size="18.206"> </text>
<text font="OTSVAU+Roboto-Bold" bbox="448.260,713.388,453.573,731.593" size="18.206">-</text>
<text font="OTSVAU+Roboto-Bold" bbox="453.573,713.388,456.992,731.593" size="18.206"> </text>
<text font="OTSVAU+Roboto-Bold" bbox="456.992,713.388,464.859,731.593" size="18.206">3</text>
<text font="OTSVAU+Roboto-Bold" bbox="464.859,713.388,472.727,731.593" size="18.206">1</text>
<text font="OTSVAU+Roboto-Bold" bbox="472.727,713.388,477.848,731.593" size="18.206">/</text>
<text font="OTSVAU+Roboto-Bold" bbox="477.848,713.388,485.715,731.593" size="18.206">0</text>
<text font="OTSVAU+Roboto-Bold" bbox="485.715,713.388,493.582,731.593" size="18.206">3</text>
<text font="OTSVAU+Roboto-Bold" bbox="493.582,713.388,498.704,731.593" size="18.206">/</text>
<text font="OTSVAU+Roboto-Bold" bbox="498.704,713.388,506.571,731.593" size="18.206">2</text>
<text font="OTSVAU+Roboto-Bold" bbox="506.571,713.388,514.438,731.593" size="18.206">0</text>
<text font="OTSVAU+Roboto-Bold" bbox="514.438,713.388,522.306,731.593" size="18.206">2</text>
<text font="OTSVAU+Roboto-Bold" bbox="522.306,713.388,530.173,731.593" size="18.206">1</text>
<text>
</text>
</textline>

vestronge avatar May 23 '21 14:05 vestronge