WikiEntVec icon indicating copy to clipboard operation
WikiEntVec copied to clipboard

duplicates in the dictionary

Open aoussou opened this issue 4 years ago • 1 comments

Hello,

I appears that some words appear multiple times in the reference dictionary for the pretrained model. The duplicates are in square brackets. Could you please explain the difference between the two?

Example when looking for words with maximum similarity for 生物, you get both

動物:0.8069652318954468 and [動物]: 0.7727672457695007

Screenshot from 2021-05-09 22-57-11

This is the code I run to get this result:

import gensim
model_dir = './20170201/entity_vector/entity_vector.model.bin'
model = gensim.models.KeyedVectors.load_word2vec_format(model_dir, binary=True)
vector = '生物'
syms = model.similar_by_vector(vector, topn=10, restrict_vocab=None)

I also check the entity_vector.model.txt file and there are indeed words with and without square brackets.

aoussou avatar May 09 '21 14:05 aoussou

Hi, @aoussou.

The square-bracketed words are named entities (NEs), which are originally anchor texts of the hyperlinks in Wikipedia articles.

Since our algorithm of detecting NEs is distantly-supervised, there are cases where some conceptual words (e.g., 動物 or animal) are processed both as words and as NEs. Please refer to README#concepts for more information.

Note that, in the recent versions of the distributed files, such NE tokens are formatted with ## signs. (We have just updated the README to clarify this issue)

Thanks!

singletongue avatar May 11 '21 06:05 singletongue