iis icon indicating copy to clipboard operation
iis copied to clipboard

Propagate citation matches to references sharing the same text

Open marekhorst opened this issue 8 years ago • 1 comments

Currently we are able to match over 14 mln citations out of 101 mln bibliographic references on beta infrastructure using direct citation matching algorithm (matching based on external identifiers such as pmid and doi provided in references obtained from EPMC).

According to our research over 3 mln of unmatched citations have exact text matches among matched citations. This is caused by the fact some of the bibliographic references were not supplemented with external identifiers even though the same references (comparing by text) in different publications had external identifier set. It may be EPMC corpus incompleteness but also some of those references may originate from PDF files, where external identifiers are not defined.

We should check those candidates, filter out all invalid references (nulls, empty, truncated etc), and propagate OpenAIRE matches to unmatched bibliographic references before running fuzzy citation matching algorithm.

Having #834 already integrated, we should include all those new matches in input_matched_citations fuzzy citation matching input.

marekhorst avatar Oct 13 '17 11:10 marekhorst

Part of this task could be undertaken in the scope of #1154, at least in the context of bibrefs deduplication.

marekhorst avatar Sep 25 '20 15:09 marekhorst