pdfalto icon indicating copy to clipboard operation
pdfalto copied to clipboard

Error case with invalid characters mapping

Open lfoppiano opened this issue 1 year ago • 0 comments

Document DOI: dx.doi.org/10.1063/1.3068408 (probably there is a paywall)

image

pdfalto from grobid-0.7.2-SNAPSHOT extracts it as (interesting is with ID=p1_w193):

            <String ID="p1_w186" CONTENT="i" HPOS="170.001" VPOS="229.501" WIDTH="1.9417" HEIGHT="5.9928" STYLEREFS="font10"/>
            <String ID="p1_w187" CONTENT="H" HPOS="172.082" VPOS="224.848" WIDTH="7.2041" HEIGHT="8.5611" STYLEREFS="font9"/>
            <String ID="p1_w188" CONTENT="c" HPOS="179.484" VPOS="228.886" WIDTH="3.1012" HEIGHT="5.9928" STYLEREFS="font10"/>
            <SP WIDTH="1.0350" VPOS="228.886" HPOS="182.585"/>
            <String ID="p1_w189" CONTENT="=" HPOS="183.620" VPOS="224.758" WIDTH="5.6276" HEIGHT="8.7707" STYLEREFS="font7"/>
            <SP WIDTH="1.0936" VPOS="224.758" HPOS="189.248"/>
            <String ID="p1_w190" CONTENT="6.2" HPOS="190.341" VPOS="224.758" WIDTH="12.4725" HEIGHT="8.7707" STYLEREFS="font7"/>
            <SP WIDTH="5.1876" VPOS="224.758" HPOS="202.814"/>
            <String ID="p1_w191" CONTENT="kOe" HPOS="208.001" VPOS="224.758" WIDTH="16.6233" HEIGHT="8.7707" STYLEREFS="font7"/>
            <SP WIDTH="9.4422" VPOS="224.758" HPOS="224.625"/>
            <String ID="p1_w192" CONTENT="and" HPOS="234.067" VPOS="224.758" WIDTH="14.4082" HEIGHT="8.7707" STYLEREFS="font7"/>
            <SP WIDTH="9.4402" VPOS="224.758" HPOS="248.475"/>
            <String ID="p1_w193" CONTENT="͑BH͒" HPOS="257.915" VPOS="223.880" WIDTH="20.5407" HEIGHT="9.9780" STYLEREFS="font6"/>
            <String ID="p1_w194" CONTENT="max" HPOS="278.653" VPOS="228.823" WIDTH="12.0289" HEIGHT="6.1395" STYLEREFS="font8"/>
            <SP WIDTH="0.8959" VPOS="228.823" HPOS="290.682"/>
            <String ID="p1_w195" CONTENT="=" HPOS="291.578" VPOS="224.758" WIDTH="5.6276" HEIGHT="8.7707" STYLEREFS="font7"/>
            <SP WIDTH="1.0936" VPOS="224.758" HPOS="297.206"/>
            <String ID="p1_w196" CONTENT="5.6" HPOS="298.299" VPOS="224.758" WIDTH="12.4725" HEIGHT="8.7707" STYLEREFS="font7"/>
            <SP WIDTH="5.1876" VPOS="224.758" HPOS="310.772"/>
            <String ID="p1_w197" CONTENT="MG" HPOS="315.959" VPOS="224.758" WIDTH="16.0746" HEIGHT="8.7707" STYLEREFS="font7"/>
            <SP WIDTH="2.6921" VPOS="224.758" HPOS="332.034"/>
            <String ID="p1_w198" CONTENT="Oe" HPOS="334.726" VPOS="224.758" WIDTH="11.6343" HEIGHT="8.7707" STYLEREFS="font7"/>
            <SP WIDTH="9.4422" VPOS="224.758" HPOS="346.360"/>

The characters are mapped incorrectly? I forgot what was the cause

            <String ID="p1_w193" CONTENT="͑BH͒" HPOS="257.915" VPOS="223.880" WIDTH="20.5407" HEIGHT="9.9780" STYLEREFS="font6"/>

The parenthesis seems to be merged into the two characters B and H

lfoppiano avatar Jul 22 '22 08:07 lfoppiano