grobid
grobid copied to clipboard
Grobid missing some text sequences
Hello,
I was trying to use Grobid on a sample pdf and I found that it is missing some of the tokens, this issue might be similar to another one already open at
https://github.com/kermitt2/grobid/issues/812
'Missing (very few tokens) in the generated segmentation training data #812'
I am using the latest version of grobid i.e. 0.70
so here is the image of page 8 on the sample pdf (attached)
data:image/s3,"s3://crabby-images/b500d/b500db9cf9d1d882ee80842c414244fbc2d4ef53" alt="Screenshot 2021-08-26 at 11 34 22 AM"
missing tokens in XML extraction:
Pdf and XML file uploaded (see Archive.zip) Archive.zip
Really appreciate if you could have a look.
Thank you @mv96 !
This is not related to #812 and looks like a bug when decoding the labeled sequence.
It seems to me that is due to incorrect labelling as figure, and the "missing part" ends up a a figure somewhere else:
data:image/s3,"s3://crabby-images/1005f/1005fe01f677b0d1cbaceb1724f7c8d4da43f354" alt="image"
Seems that some training data would be helpful for the fulltext model. Should we want to use this article for training data, I'm wondering if we could.. this article seems to CC-BY on arXiv, but the version attached does not have the arXiv "mark".
Yes @lfoppiano ! I should have looked at the resulting TEI document.
I am redesigning the figure extraction and hopefully cases like this without any "graphic element" anchors will never considered as figure and will stay as "normal" text body.
Indeed, for what concerns the figures we don't necessarily need more training data.
I was talking about the other part that is not correctly recognised, for example, the formula that is identified in the first comment (formula 9), should not be tagged as a formula, isn't it?
I was talking about the other part that is not correctly recognised, for example, the formula that is identified in the first comment (formula 9), should not be tagged as a formula, isn't it?
Currently I think everything after "suppose" would be tagged as a formula:
<formula xml:id="formula_10">Y = (Y 1 , . . . , Y m ) ( Y ) ⊆ Supp(Y ), then H Y = H (Y ) − KL Y Y .</formula>
It doesn't look great with the "Lemma" structure.
It shows that there will be a need to define a better schema for the formula elements, and something adapted for lemma, proof, etc. This is currently very basic because there's too few training data to go less superficial with the equations and mathematical objects.
Hello,
I was recently reading the NOUGAT paper (https://arxiv.org/pdf/2308.13418.pdf) and one of the tables of the paper directly compares the performance with Grobid see the image below👇
from the image above it is showing significant gains over GROBID on the task of formula identification which could possibly cover the problem discussed in this thread.
As a small experiment I tried to give the same pdf to an open implementation of Nougat,
click here to see the hosted Hugging face space👇 https://huggingface.co/spaces/ysharma/nougat
And I can see that Nougat model works decent in this case
I was using a task where I require the segmentation of paragraphs (text cut in blocks) instead of plain text output, so I was wondering if there was a possibility of using Nougat backbone with Grobid ?