TARA reference extraction for documents with empty plaintext

Open przemyslawjacewicz opened this issue 5 years ago • 1 comments

Currently TARA madis script needs nonempty plaintexts. This means that we can perform TARA project reference extraction only for a fraction of the whole documents, as we have about 10M documents with plaintext and about 130M metadata records. As stated by @johnfouf the script can be modified to run with empty plaintext. We should introduce this version of the script and process the whole metadata dataset.

Jun 30 '20 16:06 przemyslawjacewicz

Once we introduce changes proposed by @johnfouf we could swap the joining order mentioned in https://github.com/openaire/iis/pull/1098#discussion_r445615428.

Jul 07 '20 14:07 marekhorst