grobid
grobid copied to clipboard
grobid will make mistake when a reference has square brackets.
Paper:
Reference:
TEI Result:
...
<biblStruct coords="13,326.72,602.79,229.86,5.73;13,326.72,610.79,230.88,5.73" xml:id="b160">
<monogr>
<title level="m" type="main">Click" reactions for the N-terminal and side-chain functionalization of peptides with</title>
<author>
<persName coords=""><forename type="first">H</forename><surname>Pfeiffer</surname></persName>
</author>
<author>
<persName coords=""><forename type="first">A</forename><surname>Rojas</surname></persName>
</author>
<author>
<persName coords=""><forename type="first">J</forename><surname>Niesel</surname></persName>
</author>
<author>
<persName coords=""><forename type="first">U</forename><surname>Schatzschneider</surname></persName>
</author>
<author>
<persName coords=""><surname>Sonogashira</surname></persName>
</author>
<imprint>
<pubPlace>Mn(CO)</pubPlace>
</imprint>
</monogr>
<note type="raw_reference">H. Pfeiffer, A. Rojas, J. Niesel, U. Schatzschneider, Sonogashira and "Click" reac- tions for the N-terminal and side-chain functionalization of peptides with [Mn(CO)</note>
</biblStruct>
<biblStruct coords="13,350.04,618.72,207.57,5.73;13,326.72,626.72,74.63,5.73" xml:id="b161">
<monogr>
<title level="m" type="main">+-based CO releasing molecules (tpm = tris(pyrazolyl)methane)</title>
<imprint>
<date type="published" when="2009" />
<biblScope unit="page" from="4292" to="4298" />
<pubPlace>Dalton Trans</pubPlace>
</imprint>
</monogr>
<note type="raw_reference">+-based CO releasing molecules (tpm = tris(pyrazolyl)methane), Dalton Trans. (2009) 4292-4298.</note>
</biblStruct>
...
Description:
Seems grobid will make mistake when a reference has square brackets.
Thanks a lot @elonzh !
I will try to fix it with targeted training cases.
Can the reference model extract reference sequence numbers? I think the numbers will help to make reference results more robust in such a situation?
Can the reference model extract reference sequence numbers? I think the numbers will help to make reference results more robust in such a situation?
It does, the number label in bracket is extracted as reference number - but it can appear also as text in brackets in other papers. I think the problem is more that the model is ignorant of chemistry notations as negative examples.
the number label in bracket is extracted as reference number
I didn't find number labels in the TEI results and the source code for BiblioItem
.
Do you mean xml:id
? It's generated.
Labels in the reference section are extracted, but not outputted in the TEI file because, like the in-text reference markers, they are a mere choice of presentation style (e.g. like a latex \bibliographystyle{} choice).
They are used to match in-text reference markers and full reference entries in the bibliographical section.
They can be accessed in the LabelReferenceResult objects produced by the ReferenceSegmenter parser, see https://github.com/kermitt2/grobid/blob/master/grobid-core/src/main/java/org/grobid/core/engines/ReferenceSegmenterParser.java#L60
These labels are then stored in BibDataSet object which stores context information (the original reference string with its coordinates, its label) beyond BiblioItem.
From BibDataSet, the labels are used by ReferenceMarkerMatcher class to create a LuceneIndexMatcher for labels to be used to associate in-text reference markers with full references.
At the end the particular label choice (number, author+year, indice, ...) is still available in-text (in the bracket), but not used for linking the XML element, we use the clean generated xml:id
that you mention.
Thanks for your explanation!I am curious about is there a method to find the wrong reference list by some patterns.
For example,
- all references should be cited in the paper body and vice versa.
- reference labels must be in the line start if exists.
- reference labels should have the same pattern such as
[<number>]
.
The ML model may make some silly output, but if we can detect some obvious error and mark it in TEI results, maybe we can make the retraining easier and the result more predictable.
Another error case:
Paper:
Reference:
TEI Result:
<biblStruct coords="6,333.39,154.24,224.81,3.18;6,333.39,160.78,224.81,6.23" xml:id="b5">
<monogr>
<author>
<persName coords="">
<forename type="first">P K</forename>
<surname>Srijith</surname>
</persName>
</author>
<title level="m">Longitudinal Modeling of Social Media with Hawkes Process based on Users and Networks. ASONAM '17: Proceedings of the 2017</title>
<imprint>
<date type="published" when="2017" />
</imprint>
</monogr>
<note type="raw_reference">P K Srijith et al. 2017. Longitudinal Modeling of Social Media with Hawkes Process based on Users and Networks. ASONAM '17: Proceedings of the 2017</note>
</biblStruct>
<biblStruct coords="6,333.39,168.75,224.81,6.23;6,333.39,176.72,40.69,6.23" xml:id="b6">
<monogr>
<title level="m">IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining</title>
<imprint>
<date type="published" when="2017" />
</imprint>
</monogr>
<note type="raw_reference">IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (2017).</note>
</biblStruct>
Hi @elonzh !
Unfortunately the two articles given as examples cannot be used as training data - first one is CC BY but non-derivative and the second one is closed access.
If you have the chance to find similar errors in one or two CC-BY articles, don't hesitate to reference it here :)
Another error case:
Paper:
[math/0506081] The Dantzig selector: Statistical estimation when $p$ is much larger than $n$
Reference:
I detect reference errors by clustering alignments, and this algorithm is context-free and works great if the page is well-formatted.
Maybe we can integrate the algorithm into Grobid?