specs icon indicating copy to clipboard operation
specs copied to clipboard

full-text matching

Open VladimirAlexiev opened this issue 5 years ago • 1 comments

I think that in many cases it will be useful to match rows by their textual content, using it as a general context for the entities in the KB.

NLP uses that all the time, eg "bank" as financial institution in an article about finance vs "bank" as a river feature in an article about nature or geography. Approaches include TF/IDF, word embeddings, etc.

Use cases:

  • VIAF only has fields "name, birth, death", but some alt names often include profession and occupation
  • WD recon could leverage Wikipedia abstracts (the text before the first heading), which are available in DBP

Implementation:

  • The recon server should collect a specified "text molecule" for each entity by navigating specified properties and paths, and expose it as prop "full text"
  • OntoRefine should allow the user to select a bunch (or all) columns and submit their text together.
  • Maybe we should allow for separation of text by language, eg "full text (en)" vs "full text (bg)"

VladimirAlexiev avatar Nov 12 '20 16:11 VladimirAlexiev

On the surface, this seems like something that can already be done with the current API, no? If not, what would we need to change in the specs?

wetneb avatar Nov 12 '20 22:11 wetneb