grobid
grobid copied to clipboard
arXiv identifiers not extracted
Hi, I found that arXiv identifiers are not extracted from PDF files in most cases (both in the online version and in our local build). I have uploaded 2 example PDFs where this problem occurs. (I have seen only a very few cases where it succeeds.) Kindly let me know if GROBID supports extraction of arXiv ids, and, if not, is there any plan to support it in the near future 1801.00857.pdf 1801.02609.pdf
.
Hello, arXiv identifiers are well supported by the bibliographical reference models (nearly 2000 annotated examples haven been added with arXiv ids), however the header model has not been yet updated to support them similarly. So arXiv present in the header of the PDf will in general not been extracted for the moment.
I've planned to add that when I will update the header model, which is long over due... as it is quite a lot of work (training data will need to be updated too), I need to find enough free time to launch the effort - it should done in the first half of this year, so rather mid-term future.
Thank you Sir for your response and for meticulously maintaining such a complex tool for the community. We hope to see the updated header model soon!
I find regex extraction for well-defined identifiers such as arXiv ids and DOI works well, rather than training a model to detect them.
For modern arXiv ids, something like Pattern.compile("(?i)arXiv:\\d{4}[.]\\d{4,5}(v\\d+)?")
works well.
getAllBlocksClean()
in the Grobid Document class will give you the raw text of the document, you can then restrict the arXiv id search to the first few lines of the document to reduce the risk of picking up an arXiv id from the bibliography.
hello @philgooch !
DOI detection works exactly like this in the header since something like 4-5 years.
This is working very well for very discriminant identifiers indeed.
The advantage of integrating these regex in the ML model as feature (there is no ML model trained specifically for any type of identifier in GROBID) like in the citation parser, is that it helps also the prediction of other structures around, it makes the process more robust to PDF noise and it could work for more ambiguous identifiers like ISSN - so it's more general. For the citations and arxiv ids, this approach was better performing than only using a regex on the citation string independently from the CRF (but not a lot better neither if I remember well...). For the header, we will see!
arXiv ids are now well supported both for citations (since quite a long time) and in the metadata header. In the original examples PDF of this issue:
1801.00857.pdf ->
...
<monogr>
<imprint>
<date type="published" when="2018-01-02">2 Jan 2018</date>
</imprint>
</monogr>
<idno type="arXiv">arXiv:1801.00857v1[stat.ML]</idno>
</biblStruct>
1801.02609.pdf ->
<monogr>
<imprint>
<date type="published" when="2018-01-08">8 Jan 2018</date>
</imprint>
</monogr>
<idno type="arXiv">arXiv:1801.02609v1[cs.NI]</idno>
</biblStruct>
Case https://arxiv.org/pdf/2004.07180.pdf seems not working.
@elonzh I'm not sure which part you meant, the header or the citations. I tried with GROBID 0.7.0+.
The grey text watermark does get segmented incorrectly. Instead of part of the header it ends up as a figure in the body:
<figure
xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0">
<head></head>
<label></label>
<figDesc>2 SPECTER: Scientific Paper Embeddings using Citationinformed TransformERs arXiv:2004.07180v4 [cs.CL] 20 May 2020</figDesc>
</figure>
Separately, a number of the citations in the paper have, to me, strangely formatted references to arxiv.org pre-prints. They have abs/
in front of the identifier, which is both not a valid section, and also mixes the "old" and "new" identifier styles. GROBID's behavior in this corner case seems reasonable to me. Here are some snipped examples:
<note type="raw_reference">Erik Holmer and Andreas Marfurt. 2018. Explaining away syntactic structure in semantic document rep- resentations. ArXiv, abs/1806.01620.</note>
<idno>abs/1806.01620</idno>
<note type="raw_reference">Chanwoo Jeong, Sion Jang, Hyuna Shin, Eun- jeong Lucy Park, and Sungchul Choi. 2019. A context-aware citation recommendation model with bert and graph convolutional networks. ArXiv, abs/1903.06464.</note>
<idno>abs/1903.06464</idno>