grobid Grobid missing some text sequences

Hello,

I was trying to use Grobid on a sample pdf and I found that it is missing some of the tokens, this issue might be similar to another one already open at

https://github.com/kermitt2/grobid/issues/812

'Missing (very few tokens) in the generated segmentation training data #812'

I am using the latest version of grobid i.e. 0.70

so here is the image of page 8 on the sample pdf (attached)

missing tokens in XML extraction:

WhatsApp Image 2021-08-26 at 11 33 11 AM

Pdf and XML file uploaded (see Archive.zip) Archive.zip

Really appreciate if you could have a look.

Aug 26 '21 09:08 mv96

Thank you @mv96 !

This is not related to #812 and looks like a bug when decoding the labeled sequence.

Aug 26 '21 09:08 kermitt2

It seems to me that is due to incorrect labelling as figure, and the "missing part" ends up a a figure somewhere else:

Seems that some training data would be helpful for the fulltext model. Should we want to use this article for training data, I'm wondering if we could.. this article seems to CC-BY on arXiv, but the version attached does not have the arXiv "mark".

Aug 27 '21 00:08 lfoppiano

Yes @lfoppiano ! I should have looked at the resulting TEI document.

I am redesigning the figure extraction and hopefully cases like this without any "graphic element" anchors will never considered as figure and will stay as "normal" text body.

Aug 27 '21 05:08 kermitt2

Indeed, for what concerns the figures we don't necessarily need more training data.

I was talking about the other part that is not correctly recognised, for example, the formula that is identified in the first comment (formula 9), should not be tagged as a formula, isn't it?

Aug 27 '21 05:08 lfoppiano

I was talking about the other part that is not correctly recognised, for example, the formula that is identified in the first comment (formula 9), should not be tagged as a formula, isn't it?

Currently I think everything after "suppose" would be tagged as a formula:

<formula xml:id="formula_10">Y = (Y 1 , . . . , Y m )  ( Y ) ⊆ Supp(Y ), then H Y = H (Y ) − KL Y Y .</formula>

It doesn't look great with the "Lemma" structure.

It shows that there will be a need to define a better schema for the formula elements, and something adapted for lemma, proof, etc. This is currently very basic because there's too few training data to go less superficial with the equations and mathematical objects.

Aug 27 '21 05:08 kermitt2

Hello,

I was recently reading the NOUGAT paper (https://arxiv.org/pdf/2308.13418.pdf) and one of the tables of the paper directly compares the performance with Grobid see the image below👇

from the image above it is showing significant gains over GROBID on the task of formula identification which could possibly cover the problem discussed in this thread.

As a small experiment I tried to give the same pdf to an open implementation of Nougat,

click here to see the hosted Hugging face space👇 https://huggingface.co/spaces/ysharma/nougat

And I can see that Nougat model works decent in this case

I was using a task where I require the segmentation of paragraphs (text cut in blocks) instead of plain text output, so I was wondering if there was a possibility of using Nougat backbone with Grobid ?

Oct 12 '23 14:10 mv96

grobid grobid copied to clipboard

Grobid missing some text sequences

grobid
grobid copied to clipboard