Align all the DocumentToConceptId references to a single schema usage

Open marekhorst opened this issue 5 years ago • 0 comments

This is a #1067 follow-up.

Originally (long time ago) there was only one DocumentToConceptId schema definition located at:

eu.dnetlib.iis.referenceextraction.researchinitiative.schemas.DocumentToConceptId

used by researchinitiative reference extraction algorithm. At some point it was shared and used by different algorithms relying on concept references (namely concept matching, community reference extraction etc).

After the community matching started to provide textsnippet field the dedicated schema was created at:

eu.dnetlib.iis.referenceextraction.community.schemas.DocumentToConceptId

having an additional textsnippet field defined.

So we had 2 schemas, both sharing the same name in order to be handled by the same aggregation script building DocumentToConceptIds records with concepts grouped for each publication.

In #1067 PDB mining was also extended with textsnippet field so the already existing community schema with additional textsnippet field: eu.dnetlib.iis.referenceextraction.community.schemas.DocumentToConceptId

was made more generic by moving it to: eu.dnetlib.iis.referenceextraction.common.schemas.DocumentToConceptId

and used in PDB mining context.

Since we have agreed recently all the mining algorithms should provide textsnippet field we should consider common schema as the canonical representation for all the algorithms producing concepts.

The aim of this task is to replace all the textsnippet-less schema occurences:

eu.dnetlib.iis.referenceextraction.researchinitiative.schemas.DocumentToConceptId

with:

eu.dnetlib.iis.referenceextraction.common.schemas.DocumentToConceptId

by:

making all concept-related madis scripts providing textsnippet field at output
replacing all workflow references to the schema canonical name
supplementing all JSON files being part of integration tests with textsnippet field
removing obsolete researchinitiative schema definition

Apr 21 '20 15:04 marekhorst