grobid icon indicating copy to clipboard operation
grobid copied to clipboard

Annotating formulas, listings and figures

Open Schroedi opened this issue 10 months ago • 4 comments

Hi, thanks for your awesome work!

I have some annotation questions:

  1. Formula labeling https://grobid.readthedocs.io/en/latest/training/fulltext/#formulas Advises to not include the brackets in the label. The training data includes them, though. One of multiple samples: https://github.com/kermitt2/grobid/blob/be9e6523d71518544e1394f5be56bda0e55819ef/grobid-trainer/resources/dataset/shorttext/corpus/tei/submission_106.training.shorttext.tei.xml#L10C177-L10C177

  2. Listings How should I annotate listings like Algorithm 1 in [1]? Are they figures? If so, what would be the label?

<figure>
  <head>Algorithm </head>
  <label>1</label>
  <figDesc>Online fitting of E from events and images<lb/></figDesc>
</figure>
  1. Figures I assume I should add missing figures to the figure.tei.xml file? They probably should follow the order in which they appear within the fulltext?
The following is obsolete: I found the `trash` tag in the training data

Should they contain all text+tags from the fulltext and additionally annotate the relevant parts (head, label, figDesc)? Here is an example from [1] again: ```xml

Random Saccades 50 100 150 200 250 300 160 180 200 220 240 260 280 300 Smooth Pursuit 140 160 180 200 220 240 260 0 50 100 150 200 250 300 Pixel Coordinate Pixel Coordinate Pupil in Camera Space Gaze Point in Screen Space Gaze Point in Screen Space 20°6 3°2 0°6 3°4 0°9 5°9 5°9 5°P ixel Coordinate Pixel Coordinate Fig. . Fitted pupil locations and gaze point estimates for smooth pursuit motion and random saccadic motion are shown for four different users in different colors. The figure is organized into grids; the first row plots smooth pursuit data and the second row plots random saccadic data.
``` Should I keep the first part or remove it?

[1] arXiv:2004.03577v3

Schroedi avatar Apr 18 '24 13:04 Schroedi

@Schroedi regarding point 1, the documentation is referred to the fulltext model, you should check at the data under grobid-trainer/resources/dataset/fulltext. If you look at the annotation there, they should all be following the guidelines.

The example you are linking is another model, which I'm not sure it's even used (it's last update was in 2017 😅).

lfoppiano avatar Apr 26 '24 07:04 lfoppiano

Regarding point 3, if there are missing figures in the figure.tei.xml it means that in one of the upstream models, either the segmentation or the fulltext models something is wrongly tagged.

In this case you should examine the generated training data generated from the models upstream. See Fig 2 in https://grobid.readthedocs.io/en/latest/Principles/ for more information of what is upstream and downstream.

I recommend you to work in batches of documents, and check each model's data at the same time, then move to the next model. Usually takes time to get familiar with each models' structure and working on the same model before moving to the next may be more efficient. It's just a recommendation, though.

I do my best to explain what I have been doing, feel free to point me to the unclear parts. 😅

  1. First check the generated data for the segmentation model:
    1. if there are corrections keep the corrected file
    2. If the model is good already ignore it
  2. Then move to the fulltext model's generated files. There are three possibilities:
    1. the segmentation model before did not loose data, so the body part of the article is completed. You can correct the file,
    2. the segmentation model mislabeled a substantial part of the document and this part is missing. You should ignore the file for the time being, until the segmentation model is retrained including the current's document segmentation training file.
    3. the segmentation model mislabeled a substantial part of the document in the sense that more data is available, you could remove the surplus and correct the rest of the file.
  3. here you can repeat point 2 for the next downstream model (e.g. figure model)
  4. After you finished a batch of documents, you can retrain the segmentation model, and regenerate the training data for the documents that the fulltext model generated file missed data (point 2.2).

As the training process, this explanation can be performed in an iterative way. Let me know if there are points that are not clear.

lfoppiano avatar Apr 26 '24 08:04 lfoppiano

Thank your for taking your time and your detailed answer! It really helped me.

@Schroedi regarding point 1, the documentation is referred to the fulltext model, you should check at the data under grobid-trainer/resources/dataset/fulltext. If you look at the annotation there, they should all be following the guidelines.

The example you are linking is another model, which I'm not sure it's even used (it's last update was in 2017 😅).

You're right. I think this one should be fixed though: https://github.com/kermitt2/grobid/pull/1107

The last open point is 2. Listing. Is there any special handling or should they just be figures?

Schroedi avatar Apr 29 '24 14:04 Schroedi

Thanks for the PR #1107, we might merge it at the next iteration on the models (which might happens in a few months) so that we don't forget about it.

For the listing, I don't really know, I quickly checked but did not find any training data.

lfoppiano avatar Apr 29 '24 23:04 lfoppiano