grobid
grobid copied to clipboard
Annotating formulas, listings and figures
Hi, thanks for your awesome work!
I have some annotation questions:
-
Formula labeling https://grobid.readthedocs.io/en/latest/training/fulltext/#formulas Advises to not include the brackets in the label. The training data includes them, though. One of multiple samples: https://github.com/kermitt2/grobid/blob/be9e6523d71518544e1394f5be56bda0e55819ef/grobid-trainer/resources/dataset/shorttext/corpus/tei/submission_106.training.shorttext.tei.xml#L10C177-L10C177
-
Listings How should I annotate listings like Algorithm 1 in [1]? Are they figures? If so, what would be the label?
<figure>
<head>Algorithm </head>
<label>1</label>
<figDesc>Online fitting of E from events and images<lb/></figDesc>
</figure>
- Figures I assume I should add missing figures to the figure.tei.xml file? They probably should follow the order in which they appear within the fulltext?
The following is obsolete: I found the `trash` tag in the training data
Should they contain all text+tags from the fulltext and additionally annotate the relevant parts (head, label, figDesc)? Here is an example from [1] again:
```xml
50
100
150
200
250
300
160
180
200
220
240
260
280
300
Smooth Pursuit
140
160
180
200
220
240
260
0
50
100
150
200
250
300
Pixel Coordinate
Pixel Coordinate
Pupil in Camera Space
Gaze Point in Screen Space
Gaze Point in Screen Space
20°6 3°2
0°6
3°4
0°9
5°9
5°9
5°P
ixel Coordinate
Pixel Coordinate
Fig.
.
users in different colors. The figure is organized into grids; the first row plots smooth pursuit data and the second row plots random saccadic data.
@Schroedi regarding point 1, the documentation is referred to the fulltext
model, you should check at the data under grobid-trainer/resources/dataset/fulltext
. If you look at the annotation there, they should all be following the guidelines.
The example you are linking is another model, which I'm not sure it's even used (it's last update was in 2017 😅).
Regarding point 3, if there are missing figures in the figure.tei.xml
it means that in one of the upstream models, either the segmentation
or the fulltext
models something is wrongly tagged.
In this case you should examine the generated training data generated from the models upstream. See Fig 2 in https://grobid.readthedocs.io/en/latest/Principles/ for more information of what is upstream and downstream.
I recommend you to work in batches of documents, and check each model's data at the same time, then move to the next model. Usually takes time to get familiar with each models' structure and working on the same model before moving to the next may be more efficient. It's just a recommendation, though.
I do my best to explain what I have been doing, feel free to point me to the unclear parts. 😅
- First check the generated data for the
segmentation
model:- if there are corrections keep the corrected file
- If the model is good already ignore it
- Then move to the fulltext model's generated files. There are three possibilities:
- the segmentation model before did not loose data, so the body part of the article is completed. You can correct the file,
- the segmentation model mislabeled a substantial part of the document and this part is missing. You should ignore the file for the time being, until the segmentation model is retrained including the current's document segmentation training file.
- the segmentation model mislabeled a substantial part of the document in the sense that more data is available, you could remove the surplus and correct the rest of the file.
- here you can repeat point 2 for the next downstream model (e.g. figure model)
- After you finished a batch of documents, you can retrain the segmentation model, and regenerate the training data for the documents that the
fulltext
model generated file missed data (point 2.2).
As the training process, this explanation can be performed in an iterative way. Let me know if there are points that are not clear.
Thank your for taking your time and your detailed answer! It really helped me.
@Schroedi regarding point 1, the documentation is referred to the
fulltext
model, you should check at the data undergrobid-trainer/resources/dataset/fulltext
. If you look at the annotation there, they should all be following the guidelines.The example you are linking is another model, which I'm not sure it's even used (it's last update was in 2017 😅).
You're right. I think this one should be fixed though: https://github.com/kermitt2/grobid/pull/1107
The last open point is 2. Listing. Is there any special handling or should they just be figures?
Thanks for the PR #1107, we might merge it at the next iteration on the models (which might happens in a few months) so that we don't forget about it.
For the listing, I don't really know, I quickly checked but did not find any training data.