Align all the DocumentToConceptId references to a single schema usage
This is a #1067 follow-up.
Originally (long time ago) there was only one DocumentToConceptId schema definition located at:
eu.dnetlib.iis.referenceextraction.researchinitiative.schemas.DocumentToConceptId
used by researchinitiative reference extraction algorithm. At some point it was shared and used by different algorithms relying on concept references (namely concept matching, community reference extraction etc).
After the community matching started to provide textsnippet field the dedicated schema was created at:
eu.dnetlib.iis.referenceextraction.community.schemas.DocumentToConceptId
having an additional textsnippet field defined.
So we had 2 schemas, both sharing the same name in order to be handled by the same aggregation script building DocumentToConceptIds records with concepts grouped for each publication.
In #1067 PDB mining was also extended with textsnippet field so the already existing community schema with additional textsnippet field:
eu.dnetlib.iis.referenceextraction.community.schemas.DocumentToConceptId
was made more generic by moving it to:
eu.dnetlib.iis.referenceextraction.common.schemas.DocumentToConceptId
and used in PDB mining context.
Since we have agreed recently all the mining algorithms should provide textsnippet field we should consider common schema as the canonical representation for all the algorithms producing concepts.
The aim of this task is to replace all the textsnippet-less schema occurences:
eu.dnetlib.iis.referenceextraction.researchinitiative.schemas.DocumentToConceptId
with:
eu.dnetlib.iis.referenceextraction.common.schemas.DocumentToConceptId
by:
- making all concept-related madis scripts providing
textsnippetfield at output - replacing all workflow references to the schema canonical name
- supplementing all JSON files being part of integration tests with
textsnippetfield - removing obsolete
researchinitiativeschema definition