nlpia icon indicating copy to clipboard operation
nlpia copied to clipboard

Problem with get_data()

Open meaningfromdata opened this issue 5 years ago • 2 comments

I'm trying to work through the CNN code on p. 232 of NLPIA and the get_data() function is getting hung up. The pip install of nlpia seemed to be fine.

Here's the offending line (changing limit setting doesn't seem to change anything, I have gone as low a 5000): word_vectors = get_data('w2v', limit=50000)

I also see this output the first time I run it: 2019-11-13 14:09:23,227 WARNING:nlpia.constants:107: Starting logger in nlpia.constants...

I'm running Ubuntu 16.04 and using the Spyder IDE. Any suggestions?

meaningfromdata avatar Nov 13 '19 16:11 meaningfromdata

The warning is not a bug, just a bit too verbose. We've gotten rid of it in the latest release. Unfortunately the word2vec file format provided by Google is compressed in a way that cannot be limited for the download. So the "hangup" may be in the download from dropbox where we stored the w2v file. You'll need a machine with enough disk space and internet bandwidth to download the entire file. The limit arg will only reduce the amount of RAM consumed. And it's implemented within the gensim "KeyedVector" class where we just pass it through, so we can't control how it works and whether it effectively limits the amount of RAM consumed within the gensim code. You may have to get a machine with more RAM in order to experiment with CNNs and NLP.

hobson avatar Dec 01 '19 05:12 hobson

If you use Anaconda you will be able to install nlpia in a python 3.6 environment. It has not been tested on python 3.7 and this may be why it is hanging up on you. In python 3.7 the re package seems to have a problem with the regular expressions we use to change the filenames during decompression. I'll check it and make sure there's not a bug in get_data.

hobson avatar Dec 01 '19 06:12 hobson