pdf2txt.py creates duplicate of characters
pdf2txt.py -t xml "path/to/input.pdf" > "/path/to/output.xml"
The xml output has duplicate values in the xml. They usually seem to be in the same text box but that is not always the case. But the duplicates have the exact coordinates. I tested it on other parsers and the output did not have any duplicates.
Here is a XML snippet from the output which has the duplicate
<textbox id="1" bbox="248.401,733.724,278.558,743.773"> <textline bbox="248.401,733.724,278.558,743.773"> <text font="PYBHHG+Helvetica-Bold" bbox="248.401,733.724,252.661,743.773" size="10.048">C</text> <text font="PYBHHG+Helvetica-Bold" bbox="252.661,733.724,257.251,743.773" size="10.048">O</text> <text font="PYBHHG+Helvetica-Bold" bbox="257.251,733.724,262.166,743.773" size="10.048">M</text> <text font="PYBHHG+Helvetica-Bold" bbox="262.166,733.724,266.102,743.773" size="10.048">P</text> <text font="PYBHHG+Helvetica-Bold" bbox="266.102,733.724,270.362,743.773" size="10.048">A</text> <text font="PYBHHG+Helvetica-Bold" bbox="270.362,733.724,274.622,743.773" size="10.048">N</text> <text font="PYBHHG+Helvetica-Bold" bbox="274.622,733.724,278.558,743.773" size="10.048">Y</text> <text> </text> </textline> <textline bbox="248.401,733.724,278.558,743.773"> <text font="PYBHHG+Helvetica-Bold" bbox="248.401,733.724,252.661,743.773" size="10.048">C</text> <text font="PYBHHG+Helvetica-Bold" bbox="252.661,733.724,257.251,743.773" size="10.048">O</text> <text font="PYBHHG+Helvetica-Bold" bbox="257.251,733.724,262.166,743.773" size="10.048">M</text> <text font="PYBHHG+Helvetica-Bold" bbox="262.166,733.724,266.102,743.773" size="10.048">P</text> <text font="PYBHHG+Helvetica-Bold" bbox="266.102,733.724,270.362,743.773" size="10.048">A</text> <text font="PYBHHG+Helvetica-Bold" bbox="270.362,733.724,274.622,743.773" size="10.048">N</text> <text font="PYBHHG+Helvetica-Bold" bbox="274.622,733.724,278.558,743.773" size="10.048">Y</text> <text> </text> </textline> </textbox>
I have the same problem and I don't know how to solve it
Is there a solution for this? I have been facing this issue constantly.
<textbox id="0" bbox="28.200,755.676,221.331,776.787">
<textline bbox="28.500,755.676,221.331,776.787">
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="28.500,755.676,36.883,776.787" size="21.110">F</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="36.902,755.676,41.288,776.787" size="21.110">i</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="41.104,755.676,51.107,776.787" size="21.110">n</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="51.306,755.676,60.424,776.787" size="21.110">a</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="60.309,755.676,70.312,776.787" size="21.110">n</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="70.512,755.676,78.311,776.787" size="21.110">c</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="78.314,755.676,82.700,776.787" size="21.110">i</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="82.515,755.676,91.633,776.787" size="21.110">a</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="91.518,755.676,95.904,776.787" size="21.110">l</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="95.719,755.676,99.802,776.787" size="21.110"> </text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="99.920,755.676,108.584,776.787" size="21.110">S</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="108.322,755.676,114.502,776.787" size="21.110">t</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="114.324,755.676,123.442,776.787" size="21.110">a</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="123.326,755.676,133.330,776.787" size="21.110">n</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="133.529,755.676,143.338,776.787" size="21.110">d</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="143.132,755.676,147.518,776.787" size="21.110">i</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="147.333,755.676,157.337,776.787" size="21.110">n</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="157.536,755.676,166.286,776.787" size="21.110">g</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="165.938,755.676,170.022,776.787" size="21.110"> </text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="170.140,755.676,180.186,776.787" size="21.110">R</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="180.342,755.676,189.395,776.787" size="21.110">e</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="189.345,755.676,199.154,776.787" size="21.110">p</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="198.948,755.676,208.562,776.787" size="21.110">o</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="208.550,755.676,215.335,776.787" size="21.110">r</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="215.152,755.676,221.331,776.787" size="21.110">t</text>
<text>
</text>
</textline>
<textline bbox="28.200,755.676,221.031,776.787">
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="28.200,755.676,36.583,776.787" size="21.110">F</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="36.602,755.676,40.988,776.787" size="21.110">i</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="40.803,755.676,50.807,776.787" size="21.110">n</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="51.006,755.676,60.124,776.787" size="21.110">a</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="60.009,755.676,70.012,776.787" size="21.110">n</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="70.212,755.676,78.011,776.787" size="21.110">c</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="78.014,755.676,82.400,776.787" size="21.110">i</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="82.215,755.676,91.333,776.787" size="21.110">a</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="91.217,755.676,95.604,776.787" size="21.110">l</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="95.419,755.676,99.502,776.787" size="21.110"> </text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="99.620,755.676,108.284,776.787" size="21.110">S</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="108.022,755.676,114.202,776.787" size="21.110">t</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="114.024,755.676,123.142,776.787" size="21.110">a</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="123.026,755.676,133.030,776.787" size="21.110">n</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="133.229,755.676,143.038,776.787" size="21.110">d</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="142.832,755.676,147.218,776.787" size="21.110">i</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="147.033,755.676,157.037,776.787" size="21.110">n</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="157.236,755.676,165.986,776.787" size="21.110">g</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="165.638,755.676,169.722,776.787" size="21.110"> </text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="169.839,755.676,179.886,776.787" size="21.110">R</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="180.042,755.676,189.095,776.787" size="21.110">e</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="189.045,755.676,198.854,776.787" size="21.110">p</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="198.647,755.676,208.262,776.787" size="21.110">o</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="208.250,755.676,215.034,776.787" size="21.110">r</text>
<text font="QELAAA+OpenSansSemiBoldRegular" bbox="214.852,755.676,221.031,776.787" size="21.110">t</text>
<text>
</text>
</textline>
</textbox>
Any updates on this please? I have many examples in which the text is repeated, and we cannot remove the duplicates as the coordinates are slightly off (and hence no way of identifying the duplicates). I have tested with other parsers(like TETML) and the other parsers don't generate duplicates. I am happy to share the original pdf document if required. Sample output
<textline bbox="372.120,713.920,531.699,731.800">
<text font="DWLKVV+Roboto-Bold" bbox="372.120,713.920,379.860,731.800" size="17.881">0</text>
<text font="DWLKVV+Roboto-Bold" bbox="379.860,713.920,387.601,731.800" size="17.881">1</text>
<text font="DWLKVV+Roboto-Bold" bbox="387.601,713.920,393.547,731.800" size="17.881">/</text>
<text font="DWLKVV+Roboto-Bold" bbox="393.547,713.920,401.288,731.800" size="17.881">0</text>
<text font="DWLKVV+Roboto-Bold" bbox="401.288,713.920,409.028,731.800" size="17.881">3</text>
<text font="DWLKVV+Roboto-Bold" bbox="409.028,713.920,414.975,731.800" size="17.881">/</text>
<text font="DWLKVV+Roboto-Bold" bbox="414.975,713.920,422.715,731.800" size="17.881">2</text>
<text font="DWLKVV+Roboto-Bold" bbox="422.715,713.920,430.455,731.800" size="17.881">0</text>
<text font="DWLKVV+Roboto-Bold" bbox="430.455,713.920,438.196,731.800" size="17.881">2</text>
<text font="DWLKVV+Roboto-Bold" bbox="438.196,713.920,445.936,731.800" size="17.881">1</text>
<text font="DWLKVV+Roboto-Bold" bbox="445.936,713.920,449.294,731.800" size="17.881"> </text>
<text font="DWLKVV+Roboto-Bold" bbox="449.294,713.920,454.526,731.800" size="17.881">-</text>
<text font="DWLKVV+Roboto-Bold" bbox="454.526,713.920,457.883,731.800" size="17.881"> </text>
<text font="DWLKVV+Roboto-Bold" bbox="457.883,713.920,465.624,731.800" size="17.881">3</text>
<text font="DWLKVV+Roboto-Bold" bbox="465.624,713.920,473.364,731.800" size="17.881">1</text>
<text font="DWLKVV+Roboto-Bold" bbox="473.364,713.920,479.311,731.800" size="17.881">/</text>
<text font="DWLKVV+Roboto-Bold" bbox="479.311,713.920,487.051,731.800" size="17.881">0</text>
<text font="DWLKVV+Roboto-Bold" bbox="487.051,713.920,494.791,731.800" size="17.881">3</text>
<text font="DWLKVV+Roboto-Bold" bbox="494.791,713.920,500.738,731.800" size="17.881">/</text>
<text font="DWLKVV+Roboto-Bold" bbox="500.738,713.920,508.478,731.800" size="17.881">2</text>
<text font="DWLKVV+Roboto-Bold" bbox="508.478,713.920,516.219,731.800" size="17.881">0</text>
<text font="DWLKVV+Roboto-Bold" bbox="516.219,713.920,523.959,731.800" size="17.881">2</text>
<text font="DWLKVV+Roboto-Bold" bbox="523.959,713.920,531.699,731.800" size="17.881">1</text>
<text>
</text>
</textline>
<textline bbox="371.660,713.388,530.173,731.593">
<text font="OTSVAU+Roboto-Bold" bbox="371.660,713.388,379.527,731.593" size="18.206">0</text>
<text font="OTSVAU+Roboto-Bold" bbox="379.527,713.388,387.395,731.593" size="18.206">1</text>
<text font="OTSVAU+Roboto-Bold" bbox="387.395,713.388,392.516,731.593" size="18.206">/</text>
<text font="OTSVAU+Roboto-Bold" bbox="392.516,713.388,400.383,731.593" size="18.206">0</text>
<text font="OTSVAU+Roboto-Bold" bbox="400.383,713.388,408.250,731.593" size="18.206">3</text>
<text font="OTSVAU+Roboto-Bold" bbox="408.250,713.388,413.372,731.593" size="18.206">/</text>
<text font="OTSVAU+Roboto-Bold" bbox="413.372,713.388,421.239,731.593" size="18.206">2</text>
<text font="OTSVAU+Roboto-Bold" bbox="421.239,713.388,429.106,731.593" size="18.206">0</text>
<text font="OTSVAU+Roboto-Bold" bbox="429.106,713.388,436.974,731.593" size="18.206">2</text>
<text font="OTSVAU+Roboto-Bold" bbox="436.974,713.388,444.841,731.593" size="18.206">1</text>
<text font="OTSVAU+Roboto-Bold" bbox="444.841,713.388,448.260,731.593" size="18.206"> </text>
<text font="OTSVAU+Roboto-Bold" bbox="448.260,713.388,453.573,731.593" size="18.206">-</text>
<text font="OTSVAU+Roboto-Bold" bbox="453.573,713.388,456.992,731.593" size="18.206"> </text>
<text font="OTSVAU+Roboto-Bold" bbox="456.992,713.388,464.859,731.593" size="18.206">3</text>
<text font="OTSVAU+Roboto-Bold" bbox="464.859,713.388,472.727,731.593" size="18.206">1</text>
<text font="OTSVAU+Roboto-Bold" bbox="472.727,713.388,477.848,731.593" size="18.206">/</text>
<text font="OTSVAU+Roboto-Bold" bbox="477.848,713.388,485.715,731.593" size="18.206">0</text>
<text font="OTSVAU+Roboto-Bold" bbox="485.715,713.388,493.582,731.593" size="18.206">3</text>
<text font="OTSVAU+Roboto-Bold" bbox="493.582,713.388,498.704,731.593" size="18.206">/</text>
<text font="OTSVAU+Roboto-Bold" bbox="498.704,713.388,506.571,731.593" size="18.206">2</text>
<text font="OTSVAU+Roboto-Bold" bbox="506.571,713.388,514.438,731.593" size="18.206">0</text>
<text font="OTSVAU+Roboto-Bold" bbox="514.438,713.388,522.306,731.593" size="18.206">2</text>
<text font="OTSVAU+Roboto-Bold" bbox="522.306,713.388,530.173,731.593" size="18.206">1</text>
<text>
</text>
</textline>