twec icon indicating copy to clipboard operation
twec copied to clipboard

Is the compass fixed during training timestamped text?

Open wabyking opened this issue 3 years ago • 1 comments

from twec.twec import TWEC    
from gensim.models.word2vec import Word2Vec

def train():    
    aligner = TWEC(size=30, siter=10, diter=10, workers=4)    
    aligner.train_compass("examples/training/compass.txt", overwrite=False)     
    slice_one = aligner.train_slice("examples/training/arxiv_14.txt", save=True)    
    slice_two = aligner.train_slice("examples/training/arxiv_9.txt", save=True)

def test():    
    model1 = Word2Vec.load("model/arxiv_14.model")    
    model2 = Word2Vec.load("model/arxiv_9.model")    
    for model in [model1,model2]:    
	    print(sum(model.syn1neg))    
	    print(sum(model.wv.syn0))   

if __name__ == "__main__":    
    train()    
    test()

The output of the code is shown as below:

`[ 54.13019 -7458.793 -2588.298 3593.7505 -731.2068 1354.8907 1956.362 2851.0269 -1234.2087 -2461.2375 693.96765 4517.283 1506.449 -1617.4432 1538.4094 2772.7483 2216.757 -3763.828 2090.126 -298.45084 -294.8205 1523.8512 -4156.9824 -723.04803 -533.2238 1869.8455 -1205.959 -3589.7622 -7645.8135 -4966.196 ]

[ 75.728424 2146.7905 711.1423 -1063.1915 280.071 -428.7143 -653.19977 -737.3386 470.85577 737.1261 -51.172543 -1358.5729 -683.6471 417.5251 -398.98938 -808.00616 -600.1352 1040.6033 -659.40375 73.63555 73.206184 -372.51102 1261.4464 297.45206 212.58424 -495.39255 383.86707 955.2797 2138.7588 1448.5309 ]

[ -206.65964 -5041.21 -2035.7019 2772.9456 -725.939 1060.8079 1505.944 2003.1798 -563.0721 -1705.3502 515.3484 3435.2378 1639.7721 -1262.4358 1019.02844 1742.8516 1668.6241 -2807.0754 1269.7594 -494.86893 -221.1095 729.1342 -2732.2847 -153.8587 -501.57608 1336.3754 -1268.0028 -2143.7483 -5006.103 -3494.257 ]

[ 172.6337 2415.6028 1006.8115 -1404.1216 438.4736 -564.31976 -789.87054 -883.25604 373.4988 959.29047 -90.953415 -1708.9927 -1026.617 612.2216 -466.75372 -864.9828 -801.6127 1305.5497 -626.8068 282.45493 129.64682 -274.8585 1347.7399 130.84848 272.3334 -714.29504 643.37933 997.20715 2441.326 1698.4065 ]`

We could clearly see that both the context embedding and target embedding are not fixed. If the compass is not fixed, this work could be very similar to Kim et.al., and all word embedding in different time are not aligned, and therefore it is a little bit risky to compare word vectors in different years especially we train the temporal word vectors with more steps/epochs.

However, in the paper, it was stated in the section of 'Temporal Word Embeddings with a Compass':

During this training process, the target embeddings of the output matrix U are not modified, while we update the context embeddings in the input matrix Cti

Am I wrong here? Is there anything I did not notice?

wabyking avatar May 19 '21 02:05 wabyking

(we are discussing over email)

The issue seems due to the pip-based installation non compiling, with cython, some scripts. We are solving this by updating our modified gensim package.

vinid avatar May 20 '21 08:05 vinid