grobid icon indicating copy to clipboard operation
grobid copied to clipboard

Feature Request: general back section (section)

Open de-code opened this issue 4 years ago • 8 comments

It only occurs to me now, that there doesn't seem to be a generic back section, section.

The Annotation guidelines for the 'segmentation' model do not mention any back section. The existing training data wraps elements in a back section, probably to keep the general TEI structure.

It does support the following specific back section elements:

  • listBibl
  • annex
  • acknowledgment

Out of those, acknowledgment is probably a sort of general back section (section). But there could be others, e.g. relating to:

  • funding
  • competing interests
  • author contributions

It would be good to be able to just extract general back sections (with title and paragraph(s)). (Then acknowledgment could just be a special case of that)

de-code avatar Jan 22 '21 20:01 de-code

Hi @kermitt2 in https://github.com/kermitt2/grobid/issues/652#issuecomment-716099725 you suggested to use annex for the funding section. Would you suggest to use that for all back sections?

de-code avatar Jan 22 '21 20:01 de-code

Hi @de-code ! Yes the approach is to use annex for all these "back matters" sections and when enough training data is available, to add a distinct label to recognize explicitly the type of section (acknowledgment was annex too if I remember well at some point). I was not fan of "front" and "back" stuff that we find in many XML article encoding (in TEI too), because it conveys the idea of layout position/presentation criteria. So I used rather header and annex.

kermitt2 avatar Jan 22 '21 21:01 kermitt2

Okay, thank you for that.

In that case, how do you differentiate between those "back matters" sections and the appendix? (I somehow thought annex was for the appendix, app-group / app in JATS) Or do you just see the appendix as one of those "back matters" section?

de-code avatar Jan 22 '21 22:01 de-code

Well this is for the training data, when sections are recognized explicitly they fall at the right place in the TEI result. When starting creating training data, the goal was to have something simplified to make the manual annotation easier, and then refine the annotation over time as we have more data and we are able to consider more ML labels, which is why back matters and annexes (in the sense of appendix/supplement) were in the same pot at the beginning.

kermitt2 avatar Jan 22 '21 22:01 kermitt2

I have a similar question relating figures and tables. The annotation guideline specifies that the should be part of the body. But there are figures or tables that belong to the back section / appendix (e.g. in DOI: 10.1101/188706). Can they be annotated as annex in that case?

de-code avatar Jan 29 '21 21:01 de-code

Relating to my last question, there seems to be a problem (or I may be misunderstanding the guideline). For example, where we have a Figure legends or Supplemental data section title, that is followed by figure information. I am now annotating the section title as annex (as it's part of the back section), but the figure as the body. In my case the model is then learning that, but GROBID probably thinks that the section is empty and doesn't include it in the response.

de-code avatar Feb 12 '21 20:02 de-code

For the segmentation model, Figures and tables normally in the "zone" where they belong (where they are referenced primarily), which is mentioned here -> https://grobid.readthedocs.io/en/latest/training/segmentation/#tables-and-figures. So for instance in the header if we have a figure as part of the abstract, or in an annex if they are part of it.

Maybe the guidelines are not drafted clearly enough, because the general rule - figure/table in the body - is too much emphasized? For preprint/submission format it's frequent that all the figures appears at the very end of the article (sometimes separated from their captions), in this case they should be labelled as "body" as they are usually figures/tables for the body part, although after the bibliographical section and annex for formatting reasons.

kermitt2 avatar Feb 13 '21 03:02 kermitt2

Okay, maybe I have misinterpreted the general rule as the overriding rule. It is also a good point that preprint / submissions may locate figure descriptions at the end while the figures would otherwise belong to the body.

Perhaps we could say, that figures and tables belong to where they are referenced first? i.e. if a figure is referenced from a body section, then it belongs to the body. But if it is only referenced from a back section, then it belongs there?

172379v1 (DOI: 10.1101/172379) is an example (from the bioRxiv 10k validation set) where I am not so sure about actually. It has a Figure legends section with Figure 1 etc. But it also has a Supplemental data section with Figure S1 etc. From that, it would appear that Figure S1 should belong to the back section. Although it is referenced by a body section.

(There may also be the question whether it makes sense to extract sections titles like Figure legends.)

de-code avatar Feb 15 '21 10:02 de-code