finalfusion-rust
finalfusion-rust copied to clipboard
Reading GoogleNews word2vec model fails
Trying to read the GoogleNews-vectors-negative300.bin word2vec model triggers this assert: https://github.com/finalfusion/finalfusion-rust/blob/main/src/chunks/vocab/simple.rs#L28
thread 'main' panicked at 'assertion failed: `(left == right)`
left: `3000000`,
right: `2999997`: words contained duplicate entries.'
(when constructing a new simple vocabulary, the number of indices (3,000,000) ends up different than the number of words (2,999,997))
After some investigations I removed this word trimming and it worked fine afterwards: https://github.com/finalfusion/finalfusion-rust/blob/main/src/compat/word2vec.rs#L98
I assume the model contains tokens that get trimmed into the same words.
Should I create a pull request to remove this line? Or is there something I'm doing wrong?
The model I used is from: https://code.google.com/archive/p/word2vec Code:
let mut reader = BufReader::new(File::open("GoogleNews-vectors-negative300.bin").unwrap());
let model = Embeddings::read_word2vec_binary(&mut reader).unwrap();