pdfalto icon indicating copy to clipboard operation
pdfalto copied to clipboard

Line number error cases - first line number not removed

Open de-code opened this issue 3 years ago • 2 comments

Hi @kermitt2

I have now merged with upstream master and during evaluation I found some error cases where the line numbers are not filtered out.

I can confirm that the line numbers are removed for the example that @lfoppiano was using: https://doi.org/10.1101/2020.04.21.054221 (i.e. it looks like I am doing at least something right).

Here are some examples where it doesn't seem to work. It appears that the first line number (1 is not removed), but subsequent line numbers appear to be removed (I currently don't have a way to visualise the lxml for confirm that more easily). Thus the title is usually affected more.

Example 1

https://www.biorxiv.org/content/10.1101/210401v1?versioned=true

    <Page ID="Page1" PHYSICAL_IMG_NR="1" WIDTH="612.000" HEIGHT="792.000">
      <PrintSpace>
        <TextBlock ID="p1_b1" HPOS="302.930" VPOS="732.329" HEIGHT="10.2120" WIDTH="6.1382">
          <TextLine WIDTH="6.1382" HEIGHT="10.2120" ID="p1_t1" HPOS="302.930" VPOS="732.329">
            <String ID="p1_w1" CONTENT="1" HPOS="302.930" VPOS="732.329" WIDTH="6.1382" HEIGHT="10.2120" STYLEREFS="font0"/>
          </TextLine>
        </TextBlock>
        <TextBlock ID="p1_b2" HPOS="72.0240" VPOS="74.8640" HEIGHT="10.8000" WIDTH="468.196">
          <TextLine WIDTH="468.196" HEIGHT="10.8000" ID="p1_t2" HPOS="72.0240" VPOS="74.8640">
            <String ID="p1_w2" CONTENT="Combinatorial" HPOS="72.0240" VPOS="74.8640" WIDTH="75.2760" HEIGHT="10.8000" STYLEREFS="font1"/>
            <SP WIDTH="11.0640" VPOS="74.8640" HPOS="147.300"/>
            <String ID="p1_w3" CONTENT="effect" HPOS="158.364" VPOS="74.8640" WIDTH="27.9720" HEIGHT="10.8000" STYLEREFS="font1"/>
            <SP WIDTH="10.9920" VPOS="74.8640" HPOS="186.336"/>
            <String ID="p1_w4" CONTENT="of" HPOS="197.328" VPOS="74.8640" WIDTH="9.9960" HEIGHT="10.8000" STYLEREFS="font1"/>
            <SP WIDTH="11.1000" VPOS="74.8640" HPOS="207.324"/>
            <String ID="p1_w5" CONTENT="promoter" HPOS="218.424" VPOS="74.8640" WIDTH="48.5040" HEIGHT="10.8000" STYLEREFS="font1"/>
            <SP WIDTH="10.9800" VPOS="74.8640" HPOS="266.928"/>
            <String ID="p1_w6" CONTENT="activity," HPOS="277.908" VPOS="74.8640" WIDTH="41.0520" HEIGHT="10.8000" STYLEREFS="font1"/>
            <SP WIDTH="11.3500" VPOS="74.8640" HPOS="318.960"/>
            <String ID="p1_w7" CONTENT="mRNA" HPOS="330.310" VPOS="74.8640" WIDTH="35.7840" HEIGHT="10.8000" STYLEREFS="font1"/>
            <SP WIDTH="11.1120" VPOS="74.8640" HPOS="366.094"/>
            <String ID="p1_w8" CONTENT="degradation" HPOS="377.206" VPOS="74.8640" WIDTH="62.0760" HEIGHT="10.8000" STYLEREFS="font1"/>
            <SP WIDTH="11.0640" VPOS="74.8640" HPOS="439.282"/>
            <String ID="p1_w9" CONTENT="and" HPOS="450.346" VPOS="74.8640" WIDTH="19.3800" HEIGHT="10.8000" STYLEREFS="font1"/>
            <SP WIDTH="11.2140" VPOS="74.8640" HPOS="469.726"/>
            <String ID="p1_w10" CONTENT="site-specific" HPOS="480.940" VPOS="74.8640" WIDTH="59.2800" HEIGHT="10.8000" STYLEREFS="font1"/>
          </TextLine>
        </TextBlock>
        <TextBlock ID="p1_b3" HPOS="48.3600" VPOS="74.7800" HEIGHT="11.0400" WIDTH="5.5973">
          <TextLine WIDTH="5.5973" HEIGHT="11.0400" ID="p1_t3" HPOS="48.3600" VPOS="74.7800"/>
        </TextBlock>
        <TextBlock ID="p1_b4" HPOS="72.0240" VPOS="102.464" HEIGHT="10.8000" WIDTH="320.806">
          <TextLine WIDTH="320.806" HEIGHT="10.8000" ID="p1_t4" HPOS="72.0240" VPOS="102.464">
            <String ID="p1_w12" CONTENT="transcriptional" HPOS="72.0240" VPOS="102.464" WIDTH="76.6200" HEIGHT="10.8000" STYLEREFS="font1"/>
            <SP WIDTH="3.0560" VPOS="102.464" HPOS="148.644"/>

Example 2

https://doi.org/10.1101/440115

    <Page ID="Page1" PHYSICAL_IMG_NR="1" WIDTH="612.000" HEIGHT="792.000">
      <PrintSpace>
        <TextBlock ID="p1_b1" HPOS="303.212" VPOS="733.266" HEIGHT="9.9110" WIDTH="5.5770">
          <TextLine WIDTH="5.5770" HEIGHT="9.9110" ID="p1_t1" HPOS="303.212" VPOS="733.266">
            <String ID="p1_w1" CONTENT="1" HPOS="303.212" VPOS="733.266" WIDTH="5.5770" HEIGHT="9.9110" STYLEREFS="font0"/>
          </TextLine>
        </TextBlock>
        <TextBlock ID="p1_b2" HPOS="48.4250" VPOS="78.1190" HEIGHT="9.9110" WIDTH="5.5770">
          <TextLine WIDTH="5.5770" HEIGHT="9.9110" ID="p1_t2" HPOS="48.4250" VPOS="78.1190"/>
        </TextBlock>
        <TextBlock ID="p1_b3" HPOS="96.4080" VPOS="75.8030" HEIGHT="12.2920" WIDTH="419.174">
          <TextLine WIDTH="419.174" HEIGHT="12.2920" ID="p1_t3" HPOS="96.4080" VPOS="75.8030">
            <String ID="p1_w3" CONTENT="The" HPOS="96.4080" VPOS="75.8030" WIDTH="23.3520" HEIGHT="12.2920" STYLEREFS="font1"/>
            <SP WIDTH="3.4860" VPOS="75.8030" HPOS="119.760"/>
            <String ID="p1_w4" CONTENT="River" HPOS="123.246" VPOS="75.8030" WIDTH="33.4320" HEIGHT="12.2920" STYLEREFS="font1"/>
            <SP WIDTH="3.5000" VPOS="75.8030" HPOS="156.678"/>
            <String ID="p1_w5" CONTENT="Runs" HPOS="160.178" VPOS="75.8030" WIDTH="31.1360" HEIGHT="12.2920" STYLEREFS="font1"/>
            <SP WIDTH="3.5000" VPOS="75.8030" HPOS="191.314"/>
            <String ID="p1_w6" CONTENT="Through" HPOS="194.814" VPOS="75.8030" WIDTH="52.9060" HEIGHT="12.2920" STYLEREFS="font1"/>
            <SP WIDTH="3.5000" VPOS="75.8030" HPOS="247.720"/>
            <String ID="p1_w7" CONTENT="It:" HPOS="251.220" VPOS="75.8030" WIDTH="14.7700" HEIGHT="12.2920" STYLEREFS="font1"/>
            <SP WIDTH="3.5000" VPOS="75.8030" HPOS="265.990"/>
            <String ID="p1_w8" CONTENT="the" HPOS="269.490" VPOS="75.8030" WIDTH="18.6620" HEIGHT="12.2920" STYLEREFS="font1"/>
            <SP WIDTH="3.5000" VPOS="75.8030" HPOS="288.152"/>
            <String ID="p1_w9" CONTENT="Athabasca" HPOS="291.652" VPOS="75.8030" WIDTH="63.0140" HEIGHT="12.2920" STYLEREFS="font1"/>
            <SP WIDTH="3.5000" VPOS="75.8030" HPOS="354.666"/>
            <String ID="p1_w10" CONTENT="River" HPOS="358.166" VPOS="75.8030" WIDTH="33.4320" HEIGHT="12.2920" STYLEREFS="font1"/>
            <SP WIDTH="3.4860" VPOS="75.8030" HPOS="391.598"/>
            <String ID="p1_w11" CONTENT="Delivers" HPOS="395.084" VPOS="75.8030" WIDTH="48.9860" HEIGHT="12.2920" STYLEREFS="font1"/>
            <SP WIDTH="3.5000" VPOS="75.8030" HPOS="444.070"/>
            <String ID="p1_w12" CONTENT="Mercury" HPOS="447.570" VPOS="75.8030" WIDTH="52.8500" HEIGHT="12.2920" STYLEREFS="font1"/>
            <SP WIDTH="3.5000" VPOS="75.8030" HPOS="500.420"/>
            <String ID="p1_w13" CONTENT="to" HPOS="503.920" VPOS="75.8030" WIDTH="11.6620" HEIGHT="12.2920" STYLEREFS="font1"/>
          </TextLine>
        </TextBlock>
        <TextBlock ID="p1_b4" HPOS="48.4250" VPOS="96.6320" HEIGHT="9.9110" WIDTH="5.5770">
          <TextLine WIDTH="5.5770" HEIGHT="9.9110" ID="p1_t4" HPOS="48.4250" VPOS="96.6320"/>
        </TextBlock>
        <TextBlock ID="p1_b5" HPOS="182.732" VPOS="94.3160" HEIGHT="12.2920" WIDTH="246.540">
          <TextLine WIDTH="246.540" HEIGHT="12.2920" ID="p1_t5" HPOS="182.732" VPOS="94.3160">
            <String ID="p1_w15" CONTENT="Aquatic" HPOS="182.732" VPOS="94.3160" WIDTH="47.4460" HEIGHT="12.2920" STYLEREFS="font1"/>
            <SP WIDTH="3.5000" VPOS="94.3160" HPOS="230.178"/>
            <String ID="p1_w16" CONTENT="Birds" HPOS="233.678" VPOS="94.3160" WIDTH="32.6760" HEIGHT="12.2920" STYLEREFS="font1"/>
            <SP WIDTH="3.5000" VPOS="94.3160" HPOS="266.354"/>
            <String ID="p1_w17" CONTENT="Breeding" HPOS="269.854" VPOS="94.3160" WIDTH="54.4460" HEIGHT="12.2920" STYLEREFS="font1"/>
            <SP WIDTH="3.5000" VPOS="94.3160" HPOS="324.300"/>
            <String ID="p1_w18" CONTENT="Far" HPOS="327.800" VPOS="94.3160" WIDTH="21.7700" HEIGHT="12.2920" STYLEREFS="font1"/>
            <SP WIDTH="3.4860" VPOS="94.3160" HPOS="349.570"/>
            <String ID="p1_w19" CONTENT="Downstream" HPOS="353.056" VPOS="94.3160" WIDTH="76.2160" HEIGHT="12.2920" STYLEREFS="font1"/>
          </TextLine>
        </TextBlock>
        <TextBlock ID="p1_b6" HPOS="48.4250" VPOS="123.277" HEIGHT="9.9110" WIDTH="5.5770">
          <TextLine WIDTH="5.5770" HEIGHT="9.9110" ID="p1_t6" HPOS="48.4250" VPOS="123.277"/>
        </TextBlock>
        <TextBlock ID="p1_b7" HPOS="48.4250" VPOS="149.145" HEIGHT="9.9110" WIDTH="5.5770">
          <TextLine WIDTH="5.5770" HEIGHT="9.9110" ID="p1_t7" HPOS="48.4250" VPOS="149.145"/>
        </TextBlock>

Example 3

https://doi.org/10.1101/434563

    <Page ID="Page1" PHYSICAL_IMG_NR="1" WIDTH="612.000" HEIGHT="792.000">
      <PrintSpace>
        <TextBlock ID="p1_b1" HPOS="516.000" VPOS="745.510" HEIGHT="10.5360" WIDTH="6.0000">
          <TextLine WIDTH="6.0000" HEIGHT="10.5360" ID="p1_t1" HPOS="516.000" VPOS="745.510">
            <String ID="p1_w1" CONTENT="1" HPOS="516.000" VPOS="745.510" WIDTH="6.0000" HEIGHT="10.5360" STYLEREFS="font0"/>
          </TextLine>
        </TextBlock>
        <TextBlock ID="p1_b2" HPOS="67.0000" VPOS="81.6980" HEIGHT="8.7800" WIDTH="5.0000">
          <TextLine WIDTH="5.0000" HEIGHT="8.7800" ID="p1_t2" HPOS="67.0000" VPOS="81.6980"/>
        </TextBlock>
        <TextBlock ID="p1_b3" HPOS="89.9960" VPOS="80.4420" HEIGHT="10.2080" WIDTH="231.088">
          <TextLine WIDTH="231.088" HEIGHT="10.2080" ID="p1_t3" HPOS="89.9960" VPOS="80.4420">
            <String ID="p1_w3" CONTENT="Schlafen" HPOS="89.9960" VPOS="80.4420" WIDTH="45.8480" HEIGHT="10.2080" STYLEREFS="font2"/>
            <SP WIDTH="3.0580" VPOS="80.4420" HPOS="135.844"/>
            <String ID="p1_w4" CONTENT="11" HPOS="138.902" VPOS="80.4420" WIDTH="12.2320" HEIGHT="10.2080" STYLEREFS="font2"/>
            <SP WIDTH="3.0580" VPOS="80.4420" HPOS="151.134"/>
            <String ID="p1_w5" CONTENT="Restricts" HPOS="154.192" VPOS="80.4420" WIDTH="47.0800" HEIGHT="10.2080" STYLEREFS="font2"/>
            <SP WIDTH="3.0580" VPOS="80.4420" HPOS="201.272"/>
            <String ID="p1_w6" CONTENT="Flavivirus" HPOS="204.330" VPOS="80.4420" WIDTH="51.3590" HEIGHT="10.2080" STYLEREFS="font2"/>
            <SP WIDTH="3.0580" VPOS="80.4420" HPOS="255.689"/>
            <String ID="p1_w7" CONTENT="Replication." HPOS="258.747" VPOS="80.4420" WIDTH="62.3370" HEIGHT="10.2080" STYLEREFS="font2"/>
          </TextLine>
        </TextBlock>
        <TextBlock ID="p1_b4" HPOS="67.0000" VPOS="112.996" HEIGHT="8.7800" WIDTH="5.0000">
          <TextLine WIDTH="5.0000" HEIGHT="8.7800" ID="p1_t4" HPOS="67.0000" VPOS="112.996"/>
        </TextBlock>
        <TextBlock ID="p1_b5" HPOS="67.0000" VPOS="138.294" HEIGHT="8.7800" WIDTH="5.0000">
          <TextLine WIDTH="5.0000" HEIGHT="8.7800" ID="p1_t5" HPOS="67.0000" VPOS="138.294"/>
        </TextBlock>
        <TextBlock ID="p1_b6" HPOS="89.9960" VPOS="137.038" HEIGHT="10.2080" WIDTH="419.244">
          <TextLine WIDTH="419.244" HEIGHT="10.2080" ID="p1_t6" HPOS="89.9960" VPOS="136.569">
            <String ID="p1_w10" CONTENT="Federico" HPOS="89.9960" VPOS="137.038" WIDTH="42.8010" HEIGHT="10.2080" STYLEREFS="font3"/>
            <SP WIDTH="3.0580" VPOS="137.038" HPOS="132.797"/>
            <String ID="p1_w11" CONTENT="Valdez" HPOS="135.855" VPOS="137.038" WIDTH="33.6270" HEIGHT="10.2080" STYLEREFS="font3"/>
            <String ID="p1_w12" CONTENT="a" HPOS="169.487" VPOS="136.569" WIDTH="3.8920" HEIGHT="6.4960" STYLEREFS="font4"/>
            <String ID="p1_w13" CONTENT="," HPOS="173.380" VPOS="137.038" WIDTH="3.0580" HEIGHT="10.2080" STYLEREFS="font3"/>
            <SP WIDTH="3.0580" VPOS="137.038" HPOS="176.438"/>
            <String ID="p1_w14" CONTENT="Julienne" HPOS="179.496" VPOS="137.038" WIDTH="40.9750" HEIGHT="10.2080" STYLEREFS="font3"/>
            <SP WIDTH="3.0580" VPOS="137.038" HPOS="220.471"/>
            <String ID="p1_w15" CONTENT="Salvador" HPOS="223.529" VPOS="137.038" WIDTH="43.4060" HEIGHT="10.2080" STYLEREFS="font3"/>
            <SP WIDTH="1.9500" VPOS="137.038" HPOS="266.935"/>

(see also https://github.com/kermitt2/grobid/issues/638#issuecomment-690310642)

de-code avatar Sep 10 '20 17:09 de-code

BTW the way I found those is by looking at regressions of using my models with the new version of GROBID vs previous version. This affected the title extraction using my models. It could be the models are no longer tolerant to line numbers. I will re-generate the training data etc. and that problem might go away.

de-code avatar Sep 10 '20 19:09 de-code

Hi Daniel,

It's actually working in these examples relatively to these starting numbers. The remaining number 1 is a page number. As the block order follows the PDF stream order by default, the page number appears at the beginning of the page in the ALTO output, although visually it is located at the end of the page.

For the first pdf for instance, we have for the text content stream:

  • first page:
1 

Combinatorial effect of promoter activity, mRNA degradation and site-specific 

transcriptional pausing in modulating protein expression noise 

Sangjin Kim 1,2,3 , Christine Jacobs-Wagner 1,2,3,4* 
  • second page:
2 

ABSTRACT 

Genetically identical cells exhibit diverse phenotypes, even when experiencing the same 

environment. This phenomenon, in part, originates from cell-to-cell variability (noise) in protein 

and so on where this first token is the page number.

This is the same case for the two other examples - page number at the beginning of the page token stream.

BUT in the second PDF however, there are a few problems with the second and fifth pages for instance with line numbers from 34 to 44 and 66 to 69 still appearing in the ALTO output. In these cases there's a slight change of width and alignment on both sides from 34 (not easy to see) and my clustering method absolutely wants an exact alignment on at least left or right... For covering that, I relaxed slightly the alignment within a 1.0 unit margin.

Screenshot from 2021-04-05 18-08-59 Screenshot from 2021-04-05 18-24-56

This case is working too now, following aaac4cd9379395f7e108b20a7a4c13e384545475.

kermitt2 avatar Apr 05 '21 19:04 kermitt2