incubator-hivemall icon indicating copy to clipboard operation
incubator-hivemall copied to clipboard

[WIP][HIVEMALL-118] word2vec

Open nzw0301 opened this issue 8 years ago • 31 comments

What changes were proposed in this pull request?

Add new algorithm: skip-gram with negative sampling (a.k.a word2vec)

What type of PR is it?

Feature

What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-118

How was this patch tested?

manual tests on EMR

To train word2vec, I used wikipedia dataset, preprocessed by this perl script.

I evaluated word vector by https://github.com/kudkudak/word-embeddings-benchmarks .

from six import iteritems
from web.datasets.similarity import fetch_MEN, fetch_WS353, fetch_SimLex999, fetch_RW, fetch_RG65, fetch_MTurk
from web.datasets.analogy import fetch_msr_analogy, fetch_google_analogy, fetch_semeval_2012_2, fetch_wordrep
import gensim
from gensim.models.word2vec import Word2Vec, LineSentence

from web.embeddings import load_embedding
from web.evaluate import evaluate_similarity, evaluate_analogy

sim_tasks = {
    "MEN      ": fetch_MEN(),
    "WS353    ": fetch_WS353(),
    "SIMLEX999": fetch_SimLex999(),
    "RW       ": fetch_RW(),
    "RG       ": fetch_RG65(),
    "MTurk    ": fetch_MTurk()
}
analogy_tasks = {
    "google": fetch_google_analogy(),
    "msr   ": fetch_msr_analogy()
}

docs = LineSentence('PATH/TO/PREPROCESSED_DATA')
model = Word2Vec(docs, size=100, window=5, min_count=15, workers=8, negative=15, hs=0, sg=1, iter=1)
model.wv.save_word2vec_format('./gensim_sg.txt')

gensim = load_embedding('./gensim_sg.txt', 'word2vec')

for name, data in iteritems(sim_tasks):
    print("Spearman correlation of scores on {} {}".format(name, evaluate_similarity(gensim, data.X, data.y)))
Spearman correlation of scores on MEN       0.6483416401993833
Spearman correlation of scores on WS353     0.6169418277184877
Spearman correlation of scores on SIMLEX999 0.3070155939988943
Spearman correlation of scores on RW        0.28548732030155277
Spearman correlation of scores on RG        0.6762247194247315
Spearman correlation of scores on MTurk     0.6471504497920156

CBoW model of hivemall

Spearman correlation of scores on MEN       0.6247965194705783
Spearman correlation of scores on WS353     0.6225747519511903
Spearman correlation of scores on SIMLEX999 0.2985588069793148
Spearman correlation of scores on RW        0.27686018664704454
Spearman correlation of scores on RG        0.6528832630934683
Spearman correlation of scores on MTurk     0.6307218892934624

Skip-gram of hivemall

Spearman correlation of scores on MEN       0.6202415358400425
Spearman correlation of scores on WS353     0.6235303875587551
Spearman correlation of scores on SIMLEX999 0.2983910352562464
Spearman correlation of scores on RW        0.2939926699533969
Spearman correlation of scores on RG        0.6782791172666216
Spearman correlation of scores on MTurk     0.6344295663665642

CBoW of hivemall when the number of reducer for training is 4

Spearman correlation of scores on MEN       0.5392243963768429
Spearman correlation of scores on WS353     0.546436543545682
Spearman correlation of scores on SIMLEX999 0.2529742987287988
Spearman correlation of scores on RW        0.28611116074856136
Spearman correlation of scores on RG        0.5040049854449996
Spearman correlation of scores on MTurk     0.5953150581437371

How to use this feature?

please see word2vec.md

Checklist

  • [x] Did you apply source code formatter, i.e., mvn formatter:format, for your commit?

nzw0301 avatar Sep 21 '17 02:09 nzw0301

Coverage Status

Coverage decreased (-0.7%) to 40.18% when pulling 7014e8552f81d59ab35c95d8fcf54c56c24ba2c9 on nzw0301:skipgram into c2b95783cf9d6fc1646a48ac928e96152eab98c6 on apache:master.

coveralls avatar Sep 21 '17 03:09 coveralls

Coverage Status

Coverage decreased (-0.7%) to 40.185% when pulling 7a5fd547caef5d1af512422dc75dd0efdf5b9466 on nzw0301:skipgram into c2b95783cf9d6fc1646a48ac928e96152eab98c6 on apache:master.

coveralls avatar Sep 21 '17 06:09 coveralls

Coverage Status

Coverage decreased (-0.7%) to 40.179% when pulling c224912606f0dfd4d7d53acb0eac3fecc016b335 on nzw0301:skipgram into c2b95783cf9d6fc1646a48ac928e96152eab98c6 on apache:master.

coveralls avatar Sep 21 '17 07:09 coveralls

Coverage Status

Coverage decreased (-0.8%) to 40.165% when pulling 83198617bcf82634d39d715e33499f99945f2ebb on nzw0301:skipgram into c2b95783cf9d6fc1646a48ac928e96152eab98c6 on apache:master.

coveralls avatar Sep 21 '17 12:09 coveralls

Coverage Status

Coverage decreased (-0.8%) to 40.156% when pulling ad2b2911b5a6ebb7b43b4981bf5ff4424425a292 on nzw0301:skipgram into c2b95783cf9d6fc1646a48ac928e96152eab98c6 on apache:master.

coveralls avatar Sep 21 '17 13:09 coveralls

Coverage Status

Coverage decreased (-0.8%) to 40.149% when pulling 39d11236d100a92d54cc46d8ffce4bd89670217f on nzw0301:skipgram into c2b95783cf9d6fc1646a48ac928e96152eab98c6 on apache:master.

coveralls avatar Sep 21 '17 14:09 coveralls

Coverage Status

Coverage decreased (-0.8%) to 40.138% when pulling e50756111378e6a173aba7574e73435364ff42d0 on nzw0301:skipgram into c2b95783cf9d6fc1646a48ac928e96152eab98c6 on apache:master.

coveralls avatar Sep 22 '17 04:09 coveralls

Coverage Status

Coverage decreased (-0.8%) to 40.137% when pulling bf5d92748577535a8f08b75fe34c1139e1c6c81b on nzw0301:skipgram into c2b95783cf9d6fc1646a48ac928e96152eab98c6 on apache:master.

coveralls avatar Sep 22 '17 07:09 coveralls

Coverage Status

Coverage decreased (-0.8%) to 40.144% when pulling a3ccaa8a38ecff49d23e94a8cc7db7f895181e06 on nzw0301:skipgram into c2b95783cf9d6fc1646a48ac928e96152eab98c6 on apache:master.

coveralls avatar Sep 22 '17 08:09 coveralls

Coverage Status

Coverage decreased (-0.8%) to 40.147% when pulling c7cba82a2eef2b5b32e67221a20c6cdb4570643a on nzw0301:skipgram into c2b95783cf9d6fc1646a48ac928e96152eab98c6 on apache:master.

coveralls avatar Sep 22 '17 09:09 coveralls

Coverage Status

Coverage decreased (-0.8%) to 40.077% when pulling bbdb561cc1bf3194128034fe194555bb6c167144 on nzw0301:skipgram into c2b95783cf9d6fc1646a48ac928e96152eab98c6 on apache:master.

coveralls avatar Sep 22 '17 11:09 coveralls

Coverage Status

Coverage decreased (-0.9%) to 40.057% when pulling 7a2f4dbfeb89eee78e6e0526027b5b0ba9162c29 on nzw0301:skipgram into c2b95783cf9d6fc1646a48ac928e96152eab98c6 on apache:master.

coveralls avatar Sep 25 '17 03:09 coveralls

Coverage Status

Coverage decreased (-0.9%) to 40.029% when pulling e0945527c68da9e2c9ab6eb86a7eb7d66bf42aa7 on nzw0301:skipgram into c2b95783cf9d6fc1646a48ac928e96152eab98c6 on apache:master.

coveralls avatar Sep 25 '17 05:09 coveralls

Coverage Status

Coverage decreased (-0.9%) to 40.025% when pulling aede5ec0cf4d01780034c0d46c456486fecc1cb3 on nzw0301:skipgram into c2b95783cf9d6fc1646a48ac928e96152eab98c6 on apache:master.

coveralls avatar Sep 25 '17 06:09 coveralls

Coverage Status

Coverage decreased (-0.9%) to 40.028% when pulling 4abdb8f3d2632baa9b3ead928bc8bb0283027e20 on nzw0301:skipgram into c2b95783cf9d6fc1646a48ac928e96152eab98c6 on apache:master.

coveralls avatar Sep 25 '17 11:09 coveralls

Coverage Status

Coverage decreased (-0.9%) to 39.993% when pulling 4abdb8f3d2632baa9b3ead928bc8bb0283027e20 on nzw0301:skipgram into c2b95783cf9d6fc1646a48ac928e96152eab98c6 on apache:master.

coveralls avatar Sep 25 '17 11:09 coveralls

Coverage Status

Coverage decreased (-0.9%) to 40.05% when pulling 8a42adf3687f8c823f255fcbd0c7ff7962f81f0b on nzw0301:skipgram into c2b95783cf9d6fc1646a48ac928e96152eab98c6 on apache:master.

coveralls avatar Sep 26 '17 03:09 coveralls

Coverage Status

Coverage decreased (-0.9%) to 40.05% when pulling d1b4270861c277109f35c8675e8297ef081f1dee on nzw0301:skipgram into c2b95783cf9d6fc1646a48ac928e96152eab98c6 on apache:master.

coveralls avatar Sep 26 '17 05:09 coveralls

Coverage Status

Coverage decreased (-0.9%) to 40.049% when pulling f19d732122d29a00e421d7f1ad0d0ca93f242c1e on nzw0301:skipgram into c2b95783cf9d6fc1646a48ac928e96152eab98c6 on apache:master.

coveralls avatar Sep 26 '17 08:09 coveralls

Coverage Status

Coverage decreased (-0.8%) to 40.149% when pulling 2b66e5e3dbf1408c0719735dcbdf05555938f3bb on nzw0301:skipgram into c2b95783cf9d6fc1646a48ac928e96152eab98c6 on apache:master.

coveralls avatar Sep 26 '17 12:09 coveralls

Coverage Status

Coverage decreased (-0.8%) to 40.108% when pulling f0abd4fd99dcace050d155533ebc2dc8768cfc79 on nzw0301:skipgram into c2b95783cf9d6fc1646a48ac928e96152eab98c6 on apache:master.

coveralls avatar Sep 26 '17 15:09 coveralls

Coverage Status

Coverage decreased (-0.9%) to 40.068% when pulling c34003804b5ef33bca6abe9113de6eb01b5e94c6 on nzw0301:skipgram into c2b95783cf9d6fc1646a48ac928e96152eab98c6 on apache:master.

coveralls avatar Sep 26 '17 17:09 coveralls

Coverage Status

Coverage decreased (-0.9%) to 40.065% when pulling af5b5becb55e58fd9db355be218035518973844a on nzw0301:skipgram into c2b95783cf9d6fc1646a48ac928e96152eab98c6 on apache:master.

coveralls avatar Sep 27 '17 06:09 coveralls

Coverage Status

Coverage decreased (-0.9%) to 40.065% when pulling d12ba32469a35653ee6a499157ad236bfcaa21cf on nzw0301:skipgram into c2b95783cf9d6fc1646a48ac928e96152eab98c6 on apache:master.

coveralls avatar Sep 27 '17 07:09 coveralls

Coverage Status

Coverage decreased (-0.9%) to 40.069% when pulling 2415589bb3a8eda3e23a765b10e01fde6c70a298 on nzw0301:skipgram into c2b95783cf9d6fc1646a48ac928e96152eab98c6 on apache:master.

coveralls avatar Sep 27 '17 09:09 coveralls

Coverage Status

Coverage decreased (-0.9%) to 40.056% when pulling da564b8cea0bd028d3f822ed750513cd28ff45c7 on nzw0301:skipgram into c2b95783cf9d6fc1646a48ac928e96152eab98c6 on apache:master.

coveralls avatar Sep 27 '17 12:09 coveralls

What type of PR is it? => Improvement should be Feature.

myui avatar Sep 28 '17 07:09 myui

@nzw0301 Please rebase to master resolving ^ conflicts.

myui avatar Sep 28 '17 07:09 myui

@myui I resolved conflicts.

nzw0301 avatar Sep 28 '17 09:09 nzw0301

Coverage Status

Coverage decreased (-0.6%) to 40.505% when pulling 8696f5ff668adf758d3545bab5885e51ce7d053e on nzw0301:skipgram into 1e42387576fabbb326d451f4a00ac22d57828711 on apache:master.

coveralls avatar Sep 28 '17 09:09 coveralls