Word2Vec.jl
                                
                                 Word2Vec.jl copied to clipboard
                                
                                    Word2Vec.jl copied to clipboard
                            
                            
                            
                        Julia interface to word2vec
Word2Vec
Julia interface to word2vec
Word2Vec takes a text corpus as input and produces the word vectors as output. Training is done using the original C code, other functionalities are pure Julia. See demo for more details.
Installation
Pkg.add("Word2Vec")
Note: Only linux and OS X are supported.
Functions
All exported functions are documented, i.e., we can type ? functionname
to get help. For a list of functions, see here.
Examples
We first download some text corpus, for example http://mattmahoney.net/dc/text8.zip.
Suppose the file text8 is stored in the current working directory.
We can train the model with the function word2vec.
julia> word2vec("text8", "text8-vec.txt", verbose = true)
Starting training using file text8
Vocab size: 71291
Words in train file: 16718843
Alpha: 0.000002  Progress: 100.04%  Words/thread/sec: 350.44k  
Now we can import the word vectors text8-vec.txt to Julia.
julia> model = wordvectors("./text8-vec")
WordVectors 71291 words, 100-element Float64 vectors
The vector representation of a word can be obtained using
get_vector.
julia> get_vector(model, "book")'
100-element Array{Float64,1}:
 -0.05446138539336186
  0.001090934639284009
  0.06498087707990222
  ⋮
 -0.0024113040415322516
  0.04755140828570571
  0.039764719065723826
The cosine similarity of book, for example, can be computed using
cosine_similar_words.
julia> cosine_similar_words(model, "book")
10-element Array{String,1}:
 "book"
 "books"
 "diary"
 "story"
 "chapter"
 "novel"
 "preface"
 "poem"
 "tale"
 "bible"
Word vectors have many interesting properties. For example,
vector("king") - vector("man") + vector("woman") is close to
vector("queen").
5-element Array{String,1}:
 "queen"
 "empress"
 "prince"
 "princess"
 "throne"
References
- 
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean, "Efficient Estimation of Word Representations in Vector Space", In Proceedings of Workshop at ICLR, 2013. [pdf] 
- 
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. "Distributed Representations of Words and Phrases and their Compositionality", In Proceedings of NIPS, 2013. [pdf] 
- 
Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig, "Linguistic Regularities in Continuous Space Word Representations", In Proceedings of NAACL HLT, 2013. [pdf] 
Acknowledgements
The design of the package is inspired by Daniel Rodriguez (@danielfrg)'s Python word2vec interface.
Reporting Bugs
Please file an issue to report a bug or request a feature.