Feature Request: general back section (section)
It only occurs to me now, that there doesn't seem to be a generic back section, section.
The Annotation guidelines for the 'segmentation' model do not mention any back section. The existing training data wraps elements in a back section, probably to keep the general TEI structure.
It does support the following specific back section elements:
listBiblannexacknowledgment
Out of those, acknowledgment is probably a sort of general back section (section).
But there could be others, e.g. relating to:
- funding
- competing interests
- author contributions
It would be good to be able to just extract general back sections (with title and paragraph(s)).
(Then acknowledgment could just be a special case of that)
Hi @kermitt2 in https://github.com/kermitt2/grobid/issues/652#issuecomment-716099725 you suggested to use annex for the funding section. Would you suggest to use that for all back sections?
Hi @de-code ! Yes the approach is to use annex for all these "back matters" sections and when enough training data is available, to add a distinct label to recognize explicitly the type of section (acknowledgment was annex too if I remember well at some point).
I was not fan of "front" and "back" stuff that we find in many XML article encoding (in TEI too), because it conveys the idea of layout position/presentation criteria. So I used rather header and annex.
Okay, thank you for that.
In that case, how do you differentiate between those "back matters" sections and the appendix? (I somehow thought annex was for the appendix, app-group / app in JATS)
Or do you just see the appendix as one of those "back matters" section?
Well this is for the training data, when sections are recognized explicitly they fall at the right place in the TEI result. When starting creating training data, the goal was to have something simplified to make the manual annotation easier, and then refine the annotation over time as we have more data and we are able to consider more ML labels, which is why back matters and annexes (in the sense of appendix/supplement) were in the same pot at the beginning.
I have a similar question relating figures and tables. The annotation guideline specifies that the should be part of the body. But there are figures or tables that belong to the back section / appendix (e.g. in DOI: 10.1101/188706). Can they be annotated as annex in that case?
Relating to my last question, there seems to be a problem (or I may be misunderstanding the guideline).
For example, where we have a Figure legends or Supplemental data section title, that is followed by figure information.
I am now annotating the section title as annex (as it's part of the back section), but the figure as the body.
In my case the model is then learning that, but GROBID probably thinks that the section is empty and doesn't include it in the response.
For the segmentation model, Figures and tables normally in the "zone" where they belong (where they are referenced primarily), which is mentioned here -> https://grobid.readthedocs.io/en/latest/training/segmentation/#tables-and-figures. So for instance in the header if we have a figure as part of the abstract, or in an annex if they are part of it.
Maybe the guidelines are not drafted clearly enough, because the general rule - figure/table in the body - is too much emphasized? For preprint/submission format it's frequent that all the figures appears at the very end of the article (sometimes separated from their captions), in this case they should be labelled as "body" as they are usually figures/tables for the body part, although after the bibliographical section and annex for formatting reasons.
Okay, maybe I have misinterpreted the general rule as the overriding rule. It is also a good point that preprint / submissions may locate figure descriptions at the end while the figures would otherwise belong to the body.
Perhaps we could say, that figures and tables belong to where they are referenced first? i.e. if a figure is referenced from a body section, then it belongs to the body. But if it is only referenced from a back section, then it belongs there?
172379v1 (DOI: 10.1101/172379) is an example (from the bioRxiv 10k validation set) where I am not so sure about actually.
It has a Figure legends section with Figure 1 etc.
But it also has a Supplemental data section with Figure S1 etc.
From that, it would appear that Figure S1 should belong to the back section.
Although it is referenced by a body section.
(There may also be the question whether it makes sense to extract sections titles like Figure legends.)