grobid icon indicating copy to clipboard operation
grobid copied to clipboard

grobid will make mistake when a reference has square brackets.

Open elonzh opened this issue 2 years ago • 9 comments

Paper:

Use of gasotransmitters for the controlled release of polymer-based nitric oxide carriers in medical applications - ScienceDirect

Reference:

image

TEI Result:

...
<biblStruct coords="13,326.72,602.79,229.86,5.73;13,326.72,610.79,230.88,5.73" xml:id="b160">
	<monogr>
		<title level="m" type="main">Click&quot; reactions for the N-terminal and side-chain functionalization of peptides with</title>
		<author>
			<persName coords=""><forename type="first">H</forename><surname>Pfeiffer</surname></persName>
		</author>
		<author>
			<persName coords=""><forename type="first">A</forename><surname>Rojas</surname></persName>
		</author>
		<author>
			<persName coords=""><forename type="first">J</forename><surname>Niesel</surname></persName>
		</author>
		<author>
			<persName coords=""><forename type="first">U</forename><surname>Schatzschneider</surname></persName>
		</author>
		<author>
			<persName coords=""><surname>Sonogashira</surname></persName>
		</author>
		<imprint>
			<pubPlace>Mn(CO)</pubPlace>
		</imprint>
	</monogr>
	<note type="raw_reference">H. Pfeiffer, A. Rojas, J. Niesel, U. Schatzschneider, Sonogashira and &quot;Click&quot; reac- tions for the N-terminal and side-chain functionalization of peptides with [Mn(CO)</note>
</biblStruct>
<biblStruct coords="13,350.04,618.72,207.57,5.73;13,326.72,626.72,74.63,5.73" xml:id="b161">
	<monogr>
		<title level="m" type="main">+-based CO releasing molecules (tpm = tris(pyrazolyl)methane)</title>
		<imprint>
			<date type="published" when="2009" />
			<biblScope unit="page" from="4292" to="4298" />
			<pubPlace>Dalton Trans</pubPlace>
		</imprint>
	</monogr>
	<note type="raw_reference">+-based CO releasing molecules (tpm = tris(pyrazolyl)methane), Dalton Trans. (2009) 4292-4298.</note>
</biblStruct>
...

Description:

Seems grobid will make mistake when a reference has square brackets.

elonzh avatar Oct 27 '21 09:10 elonzh

Thanks a lot @elonzh !

I will try to fix it with targeted training cases.

kermitt2 avatar Nov 06 '21 15:11 kermitt2

Can the reference model extract reference sequence numbers? I think the numbers will help to make reference results more robust in such a situation?

elonzh avatar Nov 08 '21 16:11 elonzh

Can the reference model extract reference sequence numbers? I think the numbers will help to make reference results more robust in such a situation?

It does, the number label in bracket is extracted as reference number - but it can appear also as text in brackets in other papers. I think the problem is more that the model is ignorant of chemistry notations as negative examples.

kermitt2 avatar Nov 08 '21 18:11 kermitt2

the number label in bracket is extracted as reference number

I didn't find number labels in the TEI results and the source code for BiblioItem.

Do you mean xml:id ? It's generated.

elonzh avatar Nov 09 '21 02:11 elonzh

Labels in the reference section are extracted, but not outputted in the TEI file because, like the in-text reference markers, they are a mere choice of presentation style (e.g. like a latex \bibliographystyle{} choice).

They are used to match in-text reference markers and full reference entries in the bibliographical section.

They can be accessed in the LabelReferenceResult objects produced by the ReferenceSegmenter parser, see https://github.com/kermitt2/grobid/blob/master/grobid-core/src/main/java/org/grobid/core/engines/ReferenceSegmenterParser.java#L60

These labels are then stored in BibDataSet object which stores context information (the original reference string with its coordinates, its label) beyond BiblioItem.

From BibDataSet, the labels are used by ReferenceMarkerMatcher class to create a LuceneIndexMatcher for labels to be used to associate in-text reference markers with full references.

At the end the particular label choice (number, author+year, indice, ...) is still available in-text (in the bracket), but not used for linking the XML element, we use the clean generated xml:id that you mention.

kermitt2 avatar Nov 09 '21 19:11 kermitt2

Thanks for your explanation!I am curious about is there a method to find the wrong reference list by some patterns.

For example,

  • all references should be cited in the paper body and vice versa.
  • reference labels must be in the line start if exists.
  • reference labels should have the same pattern such as [<number>].

The ML model may make some silly output, but if we can detect some obvious error and mark it in TEI results, maybe we can make the retraining easier and the result more predictable.

elonzh avatar Nov 19 '21 04:11 elonzh

Another error case:

Paper:

A Graph Approach to Simulate Twitter Activities with Hawkes Processes | 2021 4th International Conference on Mathematics and Statistics

Reference:

image

TEI Result:

 <biblStruct coords="6,333.39,154.24,224.81,3.18;6,333.39,160.78,224.81,6.23" xml:id="b5">
                        <monogr>
                            <author>
                                <persName coords="">
                                    <forename type="first">P K</forename>
                                    <surname>Srijith</surname>
                                </persName>
                            </author>
                            <title level="m">Longitudinal Modeling of Social Media with Hawkes Process based on Users and Networks. ASONAM &apos;17: Proceedings of the 2017</title>
                            <imprint>
                                <date type="published" when="2017" />
                            </imprint>
                        </monogr>
                        <note type="raw_reference">P K Srijith et al. 2017. Longitudinal Modeling of Social Media with Hawkes Process based on Users and Networks. ASONAM &apos;17: Proceedings of the 2017</note>
                    </biblStruct>
                    <biblStruct coords="6,333.39,168.75,224.81,6.23;6,333.39,176.72,40.69,6.23" xml:id="b6">
                        <monogr>
                            <title level="m">IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining</title>
                            <imprint>
                                <date type="published" when="2017" />
                            </imprint>
                        </monogr>
                        <note type="raw_reference">IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (2017).</note>
                    </biblStruct>

elonzh avatar Nov 25 '21 07:11 elonzh

Hi @elonzh !

Unfortunately the two articles given as examples cannot be used as training data - first one is CC BY but non-derivative and the second one is closed access.

If you have the chance to find similar errors in one or two CC-BY articles, don't hesitate to reference it here :)

kermitt2 avatar Dec 19 '21 11:12 kermitt2

Another error case:

Paper:

[math/0506081] The Dantzig selector: Statistical estimation when $p$ is much larger than $n$

Reference:

image


I detect reference errors by clustering alignments, and this algorithm is context-free and works great if the page is well-formatted.

Maybe we can integrate the algorithm into Grobid?

elonzh avatar Dec 27 '21 08:12 elonzh