iis
iis copied to clipboard
TARA reference extraction for documents with empty plaintext
Currently TARA madis script needs nonempty plaintexts. This means that we can perform TARA project reference extraction only for a fraction of the whole documents, as we have about 10M documents with plaintext and about 130M metadata records. As stated by @johnfouf the script can be modified to run with empty plaintext. We should introduce this version of the script and process the whole metadata dataset.
Once we introduce changes proposed by @johnfouf we could swap the joining order mentioned in https://github.com/openaire/iis/pull/1098#discussion_r445615428.