hepcrawl icon indicating copy to clipboard operation
hepcrawl copied to clipboard

Using refextract for unstructured references

Open fschwenn opened this issue 7 years ago • 1 comments

When the metadata for an article include references but only in an unstructured way, refextract should be used in the workflow after the individual spider (pipeline.py?).

At the moment refextract is only called if a fulltext is attached. But this wont be the case for all records. And in some cases it's even with fulltext better to start from a list of individual unstructured references than from the complete PDF, where refextract first has to find such list.

fschwenn avatar Jul 07 '17 07:07 fschwenn

As already said though email, I don't think this should be done in hepcrawl (contents of the email follow).

There are several cases that can arise:

  1. The publisher makes available a full structured reference
  2. The publisher makes available a list of unstructured references
  3. The publisher does not make any reference available in the metadata

In case 1., we don't need need refextract as Hepcrawl can do the conversion from the publisher's reference format to ours, whereas in case 3. there is nothing Hepcrawl can do besides providing the PDF.

So case 2. remains, but I think it would be better to have Hepcrawl populate the raw references in the record, and run refextract (or in the future maybe Grobid) in the workflow as is done curently to extract references from PDF. We should have a task there that does reference extraction from raw references in case they have been provided but there are no parsed references. In this way, we cleanly separate the task of translating between metadata formats (Hepcrawl) and parsing references (refextract) and in the future we can easily swap refextract for Grobid when it is mature enough.

michamos avatar Jul 10 '17 07:07 michamos