iis icon indicating copy to clipboard operation
iis copied to clipboard

TARA reference extraction for documents with empty plaintext

Open przemyslawjacewicz opened this issue 5 years ago • 1 comments

Currently TARA madis script needs nonempty plaintexts. This means that we can perform TARA project reference extraction only for a fraction of the whole documents, as we have about 10M documents with plaintext and about 130M metadata records. As stated by @johnfouf the script can be modified to run with empty plaintext. We should introduce this version of the script and process the whole metadata dataset.

przemyslawjacewicz avatar Jun 30 '20 16:06 przemyslawjacewicz

Once we introduce changes proposed by @johnfouf we could swap the joining order mentioned in https://github.com/openaire/iis/pull/1098#discussion_r445615428.

marekhorst avatar Jul 07 '20 14:07 marekhorst