grobid Header model, relative font size includes spaces with a zero font size

The header model (and possibly other models?), is calculating relative font sizes. For that it is first determining the smallest, largest and average font size for the tokens within the header block.

For example for 449918v1, the following token and font size are being used for the calculation:

Title, font_size=18.0
, font_size=0.0
\n, font_size=0.0
\n, font_size=0.0
Catch, font_size=12.0
, font_size=0.0
me, font_size=12.0

Therefore ending up with a smallest font size of 0.0 (should be 8.0) and an average of 6.2 (should be ~12.0).

pdfalto xml

<Page ID="Page1" PHYSICAL_IMG_NR="1" WIDTH="595.300" HEIGHT="841.900">
    <PrintSpace>
        <TextBlock ID="p1_b1" HPOS="517.725" VPOS="783.166" HEIGHT="9.9110" WIDTH="5.5770">
            <TextLine WIDTH="5.5770" HEIGHT="9.9110" ID="p1_t1" HPOS="517.725" VPOS="783.166">
                <String ID="p1_w1" CONTENT="1" HPOS="517.725" VPOS="783.166" WIDTH="5.5770" HEIGHT="9.9110" STYLEREFS="font0"/>
            </TextLine>
        </TextBlock>
        <TextBlock ID="p1_b2" HPOS="48.4250" VPOS="93.8520" HEIGHT="9.9110" WIDTH="5.5770">
            <TextLine WIDTH="5.5770" HEIGHT="9.9110" ID="p1_t2" HPOS="48.4250" VPOS="93.8520"/>
        </TextBlock>
        <TextBlock ID="p1_b3" HPOS="72.0020" VPOS="88.8880" HEIGHT="15.8040" WIDTH="36.0000">
            <TextLine WIDTH="36.0000" HEIGHT="15.8040" ID="p1_t3" HPOS="72.0020" VPOS="88.8880">
                <String ID="p1_w3" CONTENT="Title" HPOS="72.0020" VPOS="88.8880" WIDTH="36.0000" HEIGHT="15.8040" STYLEREFS="font1"/>
            </TextLine>
        </TextBlock>
        <TextBlock ID="p1_b4" HPOS="48.4250" VPOS="141.647" HEIGHT="9.9110" WIDTH="5.5770">
            <TextLine WIDTH="5.5770" HEIGHT="9.9110" ID="p1_t4" HPOS="48.4250" VPOS="141.647"/>
        </TextBlock>
        <TextBlock ID="p1_b5" HPOS="72.0020" VPOS="140.655" HEIGHT="10.5360" WIDTH="451.308">
            <TextLine WIDTH="451.308" HEIGHT="10.5360" ID="p1_t5" HPOS="72.0020" VPOS="140.655">
                <String ID="p1_w5" CONTENT="Catch" HPOS="72.0020" VPOS="140.655" WIDTH="27.9840" HEIGHT="10.5360" STYLEREFS="font2"/>
                <SP WIDTH="2.4840" VPOS="140.655" HPOS="99.9860"/>
                <String ID="p1_w6" CONTENT="me" HPOS="102.470" VPOS="140.655" WIDTH="14.6640" HEIGHT="10.5360" STYLEREFS="font2"/>
                <SP WIDTH="2.4840" VPOS="140.655" HPOS="117.134"/>
                <String ID="p1_w7" CONTENT="if" HPOS="119.618" VPOS="140.655" WIDTH="7.3320" HEIGHT="10.5360" STYLEREFS="font2"/>
                <SP WIDTH="2.4840" VPOS="140.655" HPOS="126.950"/>
                <String ID="p1_w8" CONTENT="you" HPOS="129.434" VPOS="140.655" WIDTH="18.0000" HEIGHT="10.5360" STYLEREFS="font2"/>
                <SP WIDTH="2.4840" VPOS="140.655" HPOS="147.434"/>
                <String ID="p1_w9" CONTENT="can:" HPOS="149.918" VPOS="140.655" WIDTH="19.9920" HEIGHT="10.5360" STYLEREFS="font2"/>
                <SP WIDTH="2.4720" VPOS="140.655" HPOS="169.910"/>
<!-- ... -->

Relevant code block: https://github.com/kermitt2/grobid/blob/0.6.2/grobid-core/src/main/java/org/grobid/core/engines/HeaderParser.java#L402-L433

Jul 15 '21 10:07 de-code

Thanks you @de-code ! It's only used in the header model for the moment (I was waiting to have more training data for the fulltext model to add it too).

Jul 15 '21 10:07 kermitt2

This is the usual disappointing case, where when we fix the feature as wanted initially, you get lower result (it often happens!):

Before the fix:

PMC_sample_1943

==== Levenshtein Matching ===== (Minimum Levenshtein distance at 0.8)

label                accuracy     precision    recall       f1           support

abstract             97.12        88.48        86.81        87.64        1911   
authors              99.15        96.43        96.03        96.23        1941   
first_author         99.16        96.48        96.08        96.28        1941   
keywords             96.78        86.02        80.72        83.29        1380   
title                99.49        97.83        97.58        97.71        1943   

all (micro avg.)     98.34        93.58        92.12        92.85        9116

biorxiv-10k-test-2000

==== Levenshtein Matching ===== (Minimum Levenshtein distance at 0.8)

label                accuracy     precision    recall       f1           support

abstract             94.17        76.83        74.21        75.5         1989   
authors              98.07        92.46        91.44        91.95        1998   
first_author         98.69        95.19        94.24        94.71        1996   
keywords             97.54        78.31        79.62        78.96        839    
title                98.26        95.25        92.3         93.75        1999   

all (micro avg.)     97.35        88.85        87.26        88.05        8821

After the fix:

PMC_sample_1943

==== Levenshtein Matching ===== (Minimum Levenshtein distance at 0.8)

label                accuracy     precision    recall       f1           support

abstract             97.13        88.54        86.92        87.72        1911   
authors              99.09        95.92        95.78        95.85        1941   
first_author         99.18        96.34        96.19        96.26        1941   
keywords             96.69        84.1         80.51        82.27        1380   
title                99.52        97.99        97.74        97.86        1943   

all (micro avg.)     98.32        93.19        92.11        92.65        9116

biorxiv-10k-test-2000

==== Levenshtein Matching ===== (Minimum Levenshtein distance at 0.8)

label                accuracy     precision    recall       f1           support

abstract             94.22        76.91        74.36        75.61        1989   
authors              98.05        92.41        91.34        91.87        1998   
first_author         98.63        94.94        93.94        94.43        1996   
keywords             97.43        77.53        78.55        78.03        839    
title                98.11        95.36        91.6         93.44        1999   

all (micro avg.)     97.29        88.73        86.94        87.83        8821

So it's better with the bug ;)

Or more precisely, only capturing the largest font size is enough, getting the two other relative font sizes do not appear to help with the current amount of training data.

Jul 24 '21 17:07 kermitt2

Is that with re-trained models?

Jul 24 '21 17:07 de-code

yes

Jul 24 '21 17:07 kermitt2

I was thinking whether maybe the "bug" was providing a proxy for superscript or subscript.

Jul 24 '21 21:07 de-code

I will remove the "smallestFont" boolean feature (never used) and try to replace "largerThanAverageFont" by a "superscript" boolean.

Jul 25 '21 13:07 kermitt2

The "superscript" feature degrades even more the accuracy, so I give up and keep the features as they are for the moment. With more training data in the future, new features will be explored !

Jul 27 '21 17:07 kermitt2

Just a thought: Maybe the superscript and subscript font style aren't always detected as well. Whereas the font size is always available, and so maybe this incorrect calculation happens to be better than superscript / subscript?

Jul 28 '21 20:07 de-code