Lda2vec-Tensorflow
Lda2vec-Tensorflow copied to clipboard
<UNK> Returned for Multiple Topics
"UNK " is added to the tokenizer word lists in nlppip.py because the from keras.preprocessing.text import Tokenizer is one-based.
self.tokenizer.word_index["<UNK>"] = 0
self.tokenizer.word_docs["<UNK>"] = 0
self.tokenizer.word_counts["<UNK>"] = 0
The Tensorflow implementation of word embedding and embedding lookup are zero-based.
word_embedding_000[0]
array([-0.72940636, 0.7893076 , -0.5647843 , -0.73255396, 0.7901778 ,
-0.49344468, 0.11772466, -0.5727272 , 0.527349 , -0.06881762,
0.44169998, -0.20452452, 0.3124647 , 0.86845255, -0.9390068 ,
-0.6195681 , 0.89950705, 0.3356259 , -0.8492527 , 0.45032454,
0.6324513 , 0.75457215, 0.21222615, -0.44409204, -0.06979871,
-0.6462743 , -0.36795807, 0.27780175, 0.94171906, 0.40449977,
-0.16222072, -0.34851456, -0.9734571 , -0.46344304, -0.80052805,
0.39213514, -0.23919392, -0.60179496, 0.34500718, -0.6585071 ,
0.18976736, -0.49871182, -0.31101155, 0.8082261 , 0.5178263 ,
-0.9620471 , -0.98253274, 0.5575602 , -0.5283928 , -0.05512738,
-0.46859574, -0.9827881 , 0.4550724 , -0.4175427 , -0.6799257 ,
0.32043505, -0.60924935, -0.08730078, -0.76487565, -0.11529756,
-0.05081773, -0.423831 , -0.69595194, -0.39993382, 0.01512861,
0.82286215, -0.96196485, -0.96162105, -0.69300675, -0.23160791,
-0.8725774 , -0.62869287, -0.21675658, 0.22361946, -0.7145815 ,
0.25228357, 0.300138 , 0.1944983 , -0.20161653, -0.00947928,
-0.50661993, 0.24620843, 0.8336489 , -0.6433666 , 0.4633739 ,
0.42356896, -0.2927196 , 0.7726562 , -0.77078557, 0.42736077,
0.2361381 , 0.8253889 , -0.03234029, 0.16903758, 0.64719176,
0.12639523, 0.468915 , 0.36462903, -0.63329506, 0.46308804,
0.9785025 , -0.60487294, -0.8659482 , 0.80265903, 0.08614421,
-0.6846776 , -0.2840774 , -0.05165243, 0.7902992 , 0.7554364 ,
0.07603502, -0.82541203, -0.03127742, -0.45349932, -0.6321502 ,
-0.75881124, 0.10189629, 0.7766483 , -0.02184248, 0.30532098,
0.40934992, -0.3520453 , -0.4991796 , 0.89320135, -0.5294213 ,
0.08958745, -0.2862544 , 0.694613 , -0.2933941 , -0.2711556 ,
-0.778697 , -0.90801215, -0.4771154 , 0.9393649 , 0.02598763,
-0.6128385 , 0.6687329 , -0.00300312, 0.39082742, -0.62328243,
-0.1326313 , -0.04318118, 0.5147674 , 0.30447197, -0.15042996,
-0.29966593, -0.19948554, -0.15503025, -0.07965088, -0.18107772,
-0.6654799 , 0.16734552, -0.6545446 , -0.19038987, 0.11273432,
-0.37501454, -0.01779771, 0.10266089, 0.6059449 , 0.53478146,
0.8791959 , -0.71896863, -0.50831914, 0.51859474, 0.7803166 ,
0.85757375, 0.58769774, -0.01653957, 0.35751534, -0.66742086,
0.09473515, -0.89558864, 0.5007875 , 0.6572523 , 0.47241664,
0.5635514 , 0.32414556, -0.53437877, 0.84779453, 0.6378653 ,
0.81033015, -0.9580946 , 0.4329822 , 0.7842884 , -0.02432752,
-0.26144147, 0.51170826, 0.18752575, 0.716552 , 0.19081879,
0.76230717, 0.95465493, 0.587734 , 0.9609244 , -0.95637846,
-0.8732126 , -0.4947157 , 0.4163556 , 0.08395147, 0.48358202,
0.6750531 , 0.6933727 , -0.66409326, -0.6555612 , -0.77092767,
0.77507496, 0.6416006 , -0.10126472, -0.20890045, 0.12876058,
-0.7351172 , 0.68103194, -0.575778 , 0.1444602 , -0.42351747,
-0.81415844, -0.58244324, -0.6112335 , -0.16471076, 0.5918329 ,
0.6705165 , -0.9932399 , 0.1535554 , 0.02513838, -0.6433432 ,
0.0850389 , -0.10692096, 0.21783972, -0.00443554, -0.5312202 ,
0.16654754, 0.1691029 , 0.9144945 , -0.20212364, -0.7347467 ,
0.1740458 , -0.8262415 , -0.05594969, -0.04339361, 0.439353 ,
-0.00228357, -0.6715636 , 0.879483 , 0.10999107, 0.8576815 ,
-0.38673759, -0.2496996 , 0.8718543 , 0.77182436, -0.91532016,
0.8322928 , -0.95677876, 0.11354065, 0.31194258, -0.7994232 ,
0.8070309 , -0.12008953, -0.555902 , -0.6638913 , 0.4023559 ,
-0.77688384, 0.12601566, -0.3632667 , -0.6541252 , 0.10901499,
0.3102548 , -0.40334034, 0.03114676, -0.7885685 , -0.20401645,
0.939183 , 0.17131758, 0.47609544, -0.17927122, -0.5007596 ,
0.9717326 , -0.0057416 , 0.81249833, 0.39427924, 0.18702984,
-0.4081514 , -0.47332573, -0.0909853 , -0.5931864 , 0.7257166 ,
0.18550944, 0.21591997, -0.02170038, -0.0661478 , -0.67937946,
-0.28355837, 0.7463348 , -0.32689762, 0.9659898 , -0.54855466,
0.72903705, -0.32373667, -0.92316556, 0.01121569, 0.17884326],
dtype=float32)
Curiously, the closest (e.g. - cosine-similarity) embedding vector after training for 200 epochs to embedding vector before training 0 is:
word_embedding_000 = np.load("word_weights_000.npy")
word_embedding_199 = np.load("word_weights_199.npy")
idx = np.array([cosine_similarity(x, word_embedding_000[0]) for x in word_embedding_199]).argmin()
print(idx)
2905
print(idx_to_word[2905])
disc
How could one embedding vector appear in so many [orthogonal?] topics.
EPOCH: 85
LOSS 950.43896 w2v 8.754408 lda 941.6846 lda-sim 3.299621869012659
---------Closest 10 words to given indexes----------
Topic 0 : <UNK>, vending, confidential, offender, drainage, terrace, overtime, unintended, documentation, yan
Topic 1 : <UNK>, meaning, refrain, spent, largely, ran, equally, considered, decade, exact
Topic 2 : mim, lite, lea, recalibration, sonny, l, skip, unsold, vive, allen
Topic 3 : marathi, recalibration, assamese, uzbek, tagalog, gaelic, romansh, galician, razoo, recast
Topic 4 : loophole, vive, estonian, gaelic, slovenian, maracaibo, slovak, faroese, magyar, romansh
Topic 5 : <UNK>, vacant, jos, bleeding, kivu, bye, aunt, sundar, whilst, cowboy
Topic 6 : depending, closely, decided, applied, considered, spent, contrary, isolated, especially, frequently
Topic 7 : <UNK>, spinach, frost, slew, confined, yakan, ironically, dusty, shelf, bleeding
Topic 8 : basque, assamese, azerbaijani, razoo, haitian, kiswahili, recast, icelandic, nederlands, mommy
Topic 9 : spiky, recast, andhra, tauranga, revoke, recalibration, thread, mull, menacing, motoring
Topic 10 : <UNK>, rightly, inflammatory, severity, owen, incitement, disappearance, forge, magistrate, campaigner
Topic 11 : burke, chronicle, resend, fico, activation, tauranga, fetish, interstitial, unspoken, mommy
Topic 12 : <UNK>, confined, labrador, rope, modeling, shane, terrace, downpour, vernon, nutritional
Topic 13 : nederlands, suomi, allen, icelandic, seed, afrikaans, razoo, assamese, latvian, gaelic
Topic 14 : unacceptable, practically, exact, impression, mixture, certainly, hardly, toxic, younger, capture
Topic 15 : <UNK>, emergence, nose, straw, abundant, confined, copper, decreasing, ironically, litigation
Topic 16 : <UNK>, charlotte, straw, ironically, taps, spinach, yakan, confined, slew, maduro
Topic 17 : <UNK>, aggregate, crushing, knockout, versatile, distinctive, admired, pleasure, applause, wishing
Topic 18 : gaelic, romansh, slovenian, folder, resend, recast, assamese, creole, slovak, unspoken
Topic 19 : kiswahili, newsstand, ossetic, banat, assamese, faroese, creole, oriya, confucianism, romansh
@dbl001 <UNK> means??