clinspell Suggestion: Use gensim to load fastText word vectors

Hi Pieter,

It might make sense to move the code from the fastText module to gensim for loading the word vectors because it would make it possible to just load the word vectors (.vec file) instead of having to load the full model (.bin file). This will reduce both initialization time and memory usage. I have tried doing this, and it works quite well for my use case.

Mar 12 '18 05:03 seden

Hey @seden,

Out of interest, what is the reduction in loading time you get when moving from .bin to .vec using Gensim? I was under the impression that loading from .vec was slow in general.

Thanks! Stéphan

Mar 12 '18 11:03 stephantul

In my case, the .bin file is 2.4GB and the .vec file is about 5MB. So the speedup that I get is massive when I load just the .vec file as a keyed vector file in gensim. This is because the whole backing model from Fasttext is not loaded - the disadvantage of this is that embeddings cannot be generated on the fly for misspellings. I am running a few tests right now to compare the effect of the include_misspelling parameter on my dataset. With the gensim module and include_misspelling set to False, the grid search ran within an hour. With the Fasttext module and include_misspelling set to True, the grid search has been running for the past 4 hours and the first set of parameters is not finished yet. It's not a fair comparison, but it shows you the speed difference roughly.

Edit: I see you were asking about loading time, I guess it went down from about 10 seconds to 1 second. I have a really old laptop hard disk so the difference on a more modern SSD is likely to be lower.

Mar 12 '18 12:03 seden

Hi Seden,

What is the size of your vector vocabulary? Because it has to be way too small to be a .vec file of only 5MB. For reference, the .vec file of my embeddings with vocabulary size 500K is 1GB big. Can you do a 'wc -l [.vec file]' Unix command to check the amount of words in the vocabulary?

In any case, you can do a pull request with your changes and I can check the difference in speed for a large reference vocabulary.

Mar 12 '18 12:03 PieterFivez

My vocabulary is actually very small, only about 2000 words for a very narrowly targetted application. Maybe that explains why I'm seeing such great speedups compared to what you might be getting. So even though loading time might not improve so much with gensim, I'm still seeing significant improvement in throughput. I'll run a couple of tests tomorrow to see how many words per second I can process and update this issue.

On 12-Mar-2018 5:58 PM, "Pieter Fivez" [email protected] wrote:

Hi Seden,

What is the size of your vector vocabulary? Because it has to be way too small to be a .vec file of only 5MB. For reference, the .vec file of my embeddings with vocabulary size 500K is 1GB big. Can you do a 'wc -l [.vec file]' Unix command to check the amount of words in the vocabulary?

In any case, you can do a pull request with your changes and I can check the difference in speed for a large reference vocabulary.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/clips/clinspell/issues/3#issuecomment-372293032, or mute the thread https://github.com/notifications/unsubscribe-auth/AESqoZ5a_fI5wrUEvZ-DnCq9HwW6B3J5ks5tdmnzgaJpZM4SmFh5 .

Mar 12 '18 12:03 seden

Hi @seden could you do a pull request or share you code? I’m really interested

Mar 16 '18 12:03 alepacheco

Hi guys, sorry I got sidetracked this week so far, plus I have a few things to clarify. One is that I modified the code from this repository to be compatible with the latest fastText library from pypi. It is possible that that was also causing some issues. In any case I have a compromise solution right now which uses gensim to load the word embeddings for everything except the misspellings. It uses the fastText model to generate the embeddings for the misspellings. I will clean up the code and share it tomorrow.

On 16-Mar-2018 6:16 PM, "Alejandro Pacheco" [email protected] wrote:

Hi @seden https://github.com/seden could you do a pull request or share you code? I’m really interested

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/clips/clinspell/issues/3#issuecomment-373702738, or mute the thread https://github.com/notifications/unsubscribe-auth/AESqoaiBUY2qjeoV9zvtWaxX0sMGgyYdks5te7Q4gaJpZM4SmFh5 .

Mar 16 '18 12:03 seden

Ok, the latest commit on https://github.com/seden/clinspell now implements the part-gensim/part-fasttext method that I described before. I'm seeing a massive difference in performance, but that could easily be because of the fastText version I'm using (0.8.22). I'm not able to install 0.7.6 for some reason so I just ported the code to the latest one.

Mar 17 '18 13:03 seden