Chinese-Word-Vectors
Chinese-Word-Vectors copied to clipboard
How to define a basic line of "good word2vec"
I use the toolkit to evaluate the vector, and I got the answer. However, I wonder if you can tell us what kind of value is the signal of the good vectors?
That's a good question.
The evaluation is a typical word analogy task, e.g. given the word "man", "king" and "woman", we can use word vectors to compute (king - man + woman). If the result has the highest similarity with the word "queen", it gets the correct answer. There are totally 17813 analogy questions in the evaluation set.
Analogy evaluation is to measure to what extent word vectors capture the linguistic relations. Thus, accuracy the higher the better.
For more information about the analogy evaluation, you could read the paper: Shen Li, et al. Analogical Reasoning on Chinese Morphological and Semantic Relations, ACL 2018.
If you are interested in selecting good embedding resource for downstream tasks, e.g. text classification and name entity recognition, the conclusion of this paper may be useful: Yuanyuan Qiu et al. Revisiting Correlations between Intrinsic and Extrinsic Evaluations of Word Embeddings CCL 2018