grobid icon indicating copy to clipboard operation
grobid copied to clipboard

Using /api/processReferences with HTML input

Open keto33 opened this issue 1 year ago • 2 comments

Although the main objective of GROBID is text extraction from PDF documents, its reference parsing is quite powerful. It would be very useful if we could use it to parse references given in an HTML document directly.

Not only could it be used for web articles, blogs, etc., but also, most publishers make the references of articles open while the full text is behind the paywall. Most of these reference lists do not have DOI, and the references should be parsed. This can widen the domain of GROBID, as one does not need to have access to full-text PDF to parse the references.

processRawReference already exists in the batch mode and works with HTML input too.

keto33 avatar May 01 '23 11:05 keto33

Sorry, my bad. /api/processCitationList has already been developed for this purpose :-)

keto33 avatar May 01 '23 11:05 keto33

I made some experimentations with /api/processCitationList, and I still believe it is beneficial to have /api/processReferences or a similar service for HTML inputs.

Consider this article as a typical example. If we parse the HTML page by /api/processCitationList, we get error 400. If we remove all HTML elements except the reference list to simplify the input, we get error 204. Evidently, because /api/processCitationList aims to parse raw references rather than a structured document.

Now, if we convert the HTML page to PDF and parse the PDF with /api/processReferences, GROBID successfully parse all references. It is an overkill to convert an HTML document to a PDF document and then parse it.

Let us look at the issue from a different perspective. What is the purpose of /api/processReferences? It is used when we only need the references. First, we may not have access to the PDF, but the references are available online. Second, even if we have the PDF file, it is more efficient (fewer resources are needed) to parse an HTML document as compared with a PDF document.

I am not sure how much work is needed to add this feature, as HTML input is different, though not much different from XML, which is handled by GROBID (and there are many libraries for parsing HTML). However, I believe it is worthy of the effort as it can be used for a large number of papers which are behind the paywall.

keto33 avatar May 01 '23 17:05 keto33