Avoid introducing patent and software duplicates when exporting entities
Currently the only entities exported by IIS are patent and software entities. Both are the outcome of patent and software matching.
Software entities are built based on the metadata encoded in DocumentToSoftwareUrlWithMeta avro records holding the metadata obtained from software landing pages.
Patent entities are built from Patent avro records with the metadata retrieved from EPO endpoint.
We should match those entity candidates with already existing entities within the InfoSpace graph. Whenever an existing entity is found we should update the identifier in relation record (either doc->software or doc->patent relations) and avoid exporting the new entity in such case.
The matching rules should be defined according to the metadata availability on the both ends: existing entities and new entity candidates. appln_auth and appln_nr could be used for patents, while softwareUrl could be used for software entities. We could also rely on DOIs which are currently unavailable in DocumentToSoftwareUrlWithMeta or Patent avro records but if they are available in the sources (software landing pages and EPO endpoint respectively) we might want to extend mentioned schemas and import DOIs.