Header model, relative font size includes spaces with a zero font size
The header model (and possibly other models?), is calculating relative font sizes. For that it is first determining the smallest, largest and average font size for the tokens within the header block.
For example for 449918v1, the following token and font size are being used for the calculation:
Title, font_size=18.0, font_size=0.0\n, font_size=0.0\n, font_size=0.0Catch, font_size=12.0, font_size=0.0me, font_size=12.0
Therefore ending up with a smallest font size of 0.0 (should be 8.0) and an average of 6.2 (should be ~12.0).
pdfalto xml
<Page ID="Page1" PHYSICAL_IMG_NR="1" WIDTH="595.300" HEIGHT="841.900">
<PrintSpace>
<TextBlock ID="p1_b1" HPOS="517.725" VPOS="783.166" HEIGHT="9.9110" WIDTH="5.5770">
<TextLine WIDTH="5.5770" HEIGHT="9.9110" ID="p1_t1" HPOS="517.725" VPOS="783.166">
<String ID="p1_w1" CONTENT="1" HPOS="517.725" VPOS="783.166" WIDTH="5.5770" HEIGHT="9.9110" STYLEREFS="font0"/>
</TextLine>
</TextBlock>
<TextBlock ID="p1_b2" HPOS="48.4250" VPOS="93.8520" HEIGHT="9.9110" WIDTH="5.5770">
<TextLine WIDTH="5.5770" HEIGHT="9.9110" ID="p1_t2" HPOS="48.4250" VPOS="93.8520"/>
</TextBlock>
<TextBlock ID="p1_b3" HPOS="72.0020" VPOS="88.8880" HEIGHT="15.8040" WIDTH="36.0000">
<TextLine WIDTH="36.0000" HEIGHT="15.8040" ID="p1_t3" HPOS="72.0020" VPOS="88.8880">
<String ID="p1_w3" CONTENT="Title" HPOS="72.0020" VPOS="88.8880" WIDTH="36.0000" HEIGHT="15.8040" STYLEREFS="font1"/>
</TextLine>
</TextBlock>
<TextBlock ID="p1_b4" HPOS="48.4250" VPOS="141.647" HEIGHT="9.9110" WIDTH="5.5770">
<TextLine WIDTH="5.5770" HEIGHT="9.9110" ID="p1_t4" HPOS="48.4250" VPOS="141.647"/>
</TextBlock>
<TextBlock ID="p1_b5" HPOS="72.0020" VPOS="140.655" HEIGHT="10.5360" WIDTH="451.308">
<TextLine WIDTH="451.308" HEIGHT="10.5360" ID="p1_t5" HPOS="72.0020" VPOS="140.655">
<String ID="p1_w5" CONTENT="Catch" HPOS="72.0020" VPOS="140.655" WIDTH="27.9840" HEIGHT="10.5360" STYLEREFS="font2"/>
<SP WIDTH="2.4840" VPOS="140.655" HPOS="99.9860"/>
<String ID="p1_w6" CONTENT="me" HPOS="102.470" VPOS="140.655" WIDTH="14.6640" HEIGHT="10.5360" STYLEREFS="font2"/>
<SP WIDTH="2.4840" VPOS="140.655" HPOS="117.134"/>
<String ID="p1_w7" CONTENT="if" HPOS="119.618" VPOS="140.655" WIDTH="7.3320" HEIGHT="10.5360" STYLEREFS="font2"/>
<SP WIDTH="2.4840" VPOS="140.655" HPOS="126.950"/>
<String ID="p1_w8" CONTENT="you" HPOS="129.434" VPOS="140.655" WIDTH="18.0000" HEIGHT="10.5360" STYLEREFS="font2"/>
<SP WIDTH="2.4840" VPOS="140.655" HPOS="147.434"/>
<String ID="p1_w9" CONTENT="can:" HPOS="149.918" VPOS="140.655" WIDTH="19.9920" HEIGHT="10.5360" STYLEREFS="font2"/>
<SP WIDTH="2.4720" VPOS="140.655" HPOS="169.910"/>
<!-- ... -->
Relevant code block: https://github.com/kermitt2/grobid/blob/0.6.2/grobid-core/src/main/java/org/grobid/core/engines/HeaderParser.java#L402-L433
Thanks you @de-code ! It's only used in the header model for the moment (I was waiting to have more training data for the fulltext model to add it too).
This is the usual disappointing case, where when we fix the feature as wanted initially, you get lower result (it often happens!):
Before the fix:
PMC_sample_1943
==== Levenshtein Matching ===== (Minimum Levenshtein distance at 0.8)
label accuracy precision recall f1 support
abstract 97.12 88.48 86.81 87.64 1911
authors 99.15 96.43 96.03 96.23 1941
first_author 99.16 96.48 96.08 96.28 1941
keywords 96.78 86.02 80.72 83.29 1380
title 99.49 97.83 97.58 97.71 1943
all (micro avg.) 98.34 93.58 92.12 92.85 9116
biorxiv-10k-test-2000
==== Levenshtein Matching ===== (Minimum Levenshtein distance at 0.8)
label accuracy precision recall f1 support
abstract 94.17 76.83 74.21 75.5 1989
authors 98.07 92.46 91.44 91.95 1998
first_author 98.69 95.19 94.24 94.71 1996
keywords 97.54 78.31 79.62 78.96 839
title 98.26 95.25 92.3 93.75 1999
all (micro avg.) 97.35 88.85 87.26 88.05 8821
After the fix:
PMC_sample_1943
==== Levenshtein Matching ===== (Minimum Levenshtein distance at 0.8)
label accuracy precision recall f1 support
abstract 97.13 88.54 86.92 87.72 1911
authors 99.09 95.92 95.78 95.85 1941
first_author 99.18 96.34 96.19 96.26 1941
keywords 96.69 84.1 80.51 82.27 1380
title 99.52 97.99 97.74 97.86 1943
all (micro avg.) 98.32 93.19 92.11 92.65 9116
biorxiv-10k-test-2000
==== Levenshtein Matching ===== (Minimum Levenshtein distance at 0.8)
label accuracy precision recall f1 support
abstract 94.22 76.91 74.36 75.61 1989
authors 98.05 92.41 91.34 91.87 1998
first_author 98.63 94.94 93.94 94.43 1996
keywords 97.43 77.53 78.55 78.03 839
title 98.11 95.36 91.6 93.44 1999
all (micro avg.) 97.29 88.73 86.94 87.83 8821
So it's better with the bug ;)
Or more precisely, only capturing the largest font size is enough, getting the two other relative font sizes do not appear to help with the current amount of training data.
Is that with re-trained models?
yes
I was thinking whether maybe the "bug" was providing a proxy for superscript or subscript.
I will remove the "smallestFont" boolean feature (never used) and try to replace "largerThanAverageFont" by a "superscript" boolean.
The "superscript" feature degrades even more the accuracy, so I give up and keep the features as they are for the moment. With more training data in the future, new features will be explored !
Just a thought: Maybe the superscript and subscript font style aren't always detected as well. Whereas the font size is always available, and so maybe this incorrect calculation happens to be better than superscript / subscript?