node-word2vec How to successfully load the GoogleNews-vectors-negative300 model?

Hi Philipp, I downloaded from https://code.google.com/p/word2vec/ the file GoogleNews-vectors-negative300.bin.gz

w2v = require('word2vec'); { word2vec: [Function: word2vec], word2phrase: [Function: word2phrase], loadModel: [Function: loadModel], WordVector: [Function: WordVector] }

w2v.loadModel("/home/marco/crawlscrape/bashUtilitiesDir/GoogleNews-vectors-negative300.bin", function(err, model) { ... console.log(model); ... }); undefined TypeError: Cannot read property 'length' of undefined at /home/marco/node_modules/word2vec/lib/model.js:408:30 at FSReqWrap.wrapper as oncomplete

w2v.loadModel("/home/marco/crawlscrape/bashUtilitiesDir/GoogleNews-vectors-negative300.bin", function(err, model) { ... console.log(model); ... }); undefined TypeError: undefined is not a function at readOne (/home/marco/node_modules/word2vec/lib/model.js:433:55) at FSReqWrap.wrapper as oncomplete

What do I have to do in order to successfully load the GoogleNews-vectors-negative300 model?

Looking forward to your kind help. Marco

Aug 06 '15 18:08 marcoippolito

I tried doing this last week -- I'm pretty sure that it doesn't accept trained models in the binary (.bin) format, only in text format. While it's possible to convert the binary format to text, the resulting model is so big that it caused Node.js to run out of memory while consuming it. (This is independent of the RAM of the machine and has more to do with the limitations on address space on a 64-bit CPU.)

Aug 07 '15 04:08 dariusk

Hi. If the only accepted format is text format, with which the resulting model of GoogleNews-vectors-negative300.bin is so big that it causes Node.js to run out of memory while consuming it, this module, despite being potentially very usefull in many situations, cannot now be deployed and used. The best would be, as I do with a Python module, to directly load the compressed file, GoogleNews-vectors-negative300.bin.gz, in order to speed up the loading activity, and save some memory (a scarse resource, even for powerful machines).....What do you think Philipp?

Aug 07 '15 07:08 marcoippolito

I don't have a good Internet connection right now (using my phone as a router), but will look into this later this afternoon.

Aug 07 '15 10:08 Planeshifter

Okay, I made some little changes to the code. Could you please clone the Github repo, run npm install and try again? Cannot check as I do not have the corpus available right now.

Aug 07 '15 10:08 Planeshifter

Hi Philipp, tell me please what I'm doing wrong...

marco@pc:~/node_modules$ git clone https://github.com/Planeshifter/node-word2vec.git Cloning into 'node-word2vec'... remote: Counting objects: 329, done. remote: Total 329 (delta 0), reused 0 (delta 0), pack-reused 329 Ricezione degli oggetti: 100% (329/329), 283.85 KiB | 355.00 KiB/s, done. Risoluzione dei delta: 100% (175/175), done. Checking connectivity... fatto. marco@pc:~/node_modules$ cd node-word2vec marco@pc:~/node_modules/node-word2vec$ ls -a . .. data .editorconfig examples .git .gitignore .jshintignore .jshintrc lib LICENSE .npmignore package.json README.md src test .travis.yml

sudo npm install npm WARN package.json [email protected] No README data npm WARN package.json [email protected] No bin file found at ./bin/http-server npm WARN cannot run in wd [email protected] make --directory=src (wd=/home/marco/node_modules/node-word2vec)

Aug 07 '15 12:08 marcoippolito

It seems that one needs to set the flag unsafe-perm=true to run npm scripts as root, so your sudo was causing this issue. I pushed a little fix such that your code should work now. Could you try again? Thanks, Philipp

Aug 07 '15 13:08 Planeshifter

marco@pc:~/node_modules$ rm -rf node-word2vec marco@pc:~/node_modules$ git clone https://github.com/Planeshifter/node-word2vec.git Cloning into 'node-word2vec'... remote: Counting objects: 333, done. remote: Compressing objects: 100% (3/3), done. remote: Total 333 (delta 0), reused 0 (delta 0), pack-reused 329 Ricezione degli oggetti: 100% (333/333), 285.03 KiB | 0 bytes/s, done. Risoluzione dei delta: 100% (175/175), done. Checking connectivity... fatto.

marco@pc:~$ sudo npm install [sudo] password for marco: npm WARN package.json [email protected] No README data npm WARN package.json [email protected] No bin file found at ./bin/http-server npm WARN cannot run in wd [email protected] make --directory=src (wd=/home/marco/node_modules/node-word2vec) marco@pc:~$ sudo npm install unsafe-perm=true npm WARN package.json [email protected] No README data npm WARN package.json [email protected] No bin file found at ./bin/http-server npm ERR! addLocal Could not install /home/marco/unsafe-perm=true npm ERR! Linux 3.13.0-32-generic npm ERR! argv "/usr/bin/node" "/usr/bin/npm" "install" "unsafe-perm=true" npm ERR! node v0.12.7 npm ERR! npm v2.11.3 npm ERR! path /home/marco/unsafe-perm=true npm ERR! code ENOENT npm ERR! errno -2

npm ERR! enoent ENOENT, open '/home/marco/unsafe-perm=true' npm ERR! enoent This is most likely not a problem with npm itself npm ERR! enoent and is related to npm not being able to find a file. npm ERR! enoent

npm ERR! Please include the following file with any support request: npm ERR! /home/marco/npm-debug.log

Aug 07 '15 13:08 marcoippolito

I'm traveling but I'll definitely give this a shot this weekend.

Aug 07 '15 13:08 dariusk

I did this: npm install npm WARN package.json [email protected] No README data npm WARN package.json [email protected] No bin file found at ./bin/http-server

[email protected] postinstall /home/marco/node_modules/node-word2vec make --directory=src

make: ingresso nella directory "/home/marco/node_modules/node-word2vec/src" gcc word2vec.c -o word2vec -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -fno-stack-protector gcc word2phrase.c -o word2phrase -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -fno-stack-protector gcc distance.c -o distance -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -fno-stack-protector distance.c: In function ‘main’: distance.c:31:8: warning: unused variable ‘ch’ [-Wunused-variable] char ch; ^ gcc word-analogy.c -o word-analogy -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -fno-stack-protector word-analogy.c: In function ‘main’: word-analogy.c:31:8: warning: unused variable ‘ch’ [-Wunused-variable] char ch; ^ gcc compute-accuracy.c -o compute-accuracy -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -fno-stack-protector compute-accuracy.c: In function ‘main’: compute-accuracy.c:28:109: warning: unused variable ‘ch’ [-Wunused-variable] char st1[max_size], st2[max_size], st3[max_size], st4[max_size], bestw[N][max_size], file_name[max_size], ch; ^ chmod +x *.sh make: uscita dalla directory "/home/marco/node_modules/node-word2vec/src" marco@pc:~$ node

w2v = require('word2vec'); Error: Cannot find module 'word2vec' at Function.Module._resolveFilename (module.js:336:15) at Function.Module._load (module.js:278:25) at Module.require (module.js:365:17) at require (module.js:384:17) at repl:1:7 at REPLServer.defaultEval (repl.js:132:27) at bound (domain.js:254:14) at REPLServer.runBound as eval at REPLServer. (repl.js:279:12) at REPLServer.emit (events.js:107:17)

npm install unsafe-perm=true npm WARN package.json [email protected] No README data npm WARN package.json [email protected] No bin file found at ./bin/http-server npm ERR! addLocal Could not install /home/marco/unsafe-perm=true npm ERR! Linux 3.13.0-32-generic npm ERR! argv "/usr/bin/node" "/usr/bin/npm" "install" "unsafe-perm=true" npm ERR! node v0.12.7 npm ERR! npm v2.11.3 npm ERR! path /home/marco/unsafe-perm=true npm ERR! code ENOENT npm ERR! errno -2

npm ERR! enoent ENOENT, open '/home/marco/unsafe-perm=true' npm ERR! enoent This is most likely not a problem with npm itself npm ERR! enoent and is related to npm not being able to find a file. npm ERR! enoent

npm ERR! Please include the following file with any support request: npm ERR! /home/marco/npm-debug.log

What do I have to do Philipp?

Aug 07 '15 13:08 marcoippolito

Hmm, solving this problem seems to be more complicated than I had hoped. If you want to have a look yourself, the code to read binary files is located in the function readBinary in model.js. This code was generously contributed by @oskarflordal and is not written by myself. One of the errors @marcoippolito ran into was caused by the fact that as of node version v0.12, typed arrays do not possess a slice method anymore.

And somehow in the GoogleNews data set all words are missing their first characters when extracted from the binary data, the likely cause for the TypeError: Cannot read property 'length' of undefined error.

After having run a bunch of tests, it seems that right now the code does not correctly read in the vector values from the binary data, either. Oskar, if you find the time, could you have a look?

I need to look into this when I have more time. I fear this won't be resolved in a short amount of time, unfortunately.

Aug 09 '15 21:08 Planeshifter

Just published a new version of the package to npm with some changes in the readBinary function. Could you try installing it as usual via npm install word2vec and then loading the GoogleNews corpus again? My Laptop does not handle the large file size of 3.5GB, so I cannot check whether the problem is solved. Thanks!

Aug 10 '15 08:08 Planeshifter

git clone https://github.com/Planeshifter/node-word2vec.git Cloning into 'node-word2vec'... remote: Counting objects: 349, done. remote: Compressing objects: 100% (23/23), done. remote: Total 349 (delta 9), reused 0 (delta 0), pack-reused 325 Ricezione degli oggetti: 100% (349/349), 294.95 KiB | 0 bytes/s, done. Risoluzione dei delta: 100% (181/181), done. Checking connectivity... fatto.

sudo npm install [sudo] password for marco: npm WARN package.json [email protected] No README data npm WARN package.json [email protected] No bin file found at ./bin/http-server npm WARN cannot run in wd [email protected] make --directory=src (wd=/home/marco/node_modules/node-word2vec) marco@pc:~$ npm install npm WARN package.json [email protected] No README data npm WARN package.json [email protected] No bin file found at ./bin/http-server

[email protected] postinstall /home/marco/node_modules/node-word2vec make --directory=src

make: ingresso nella directory "/home/marco/node_modules/node-word2vec/src" gcc word2vec.c -o word2vec -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -fno-stack-protector gcc word2phrase.c -o word2phrase -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -fno-stack-protector gcc distance.c -o distance -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -fno-stack-protector distance.c: In function ‘main’: distance.c:31:8: warning: unused variable ‘ch’ [-Wunused-variable] char ch; ^ gcc word-analogy.c -o word-analogy -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -fno-stack-protector word-analogy.c: In function ‘main’: word-analogy.c:31:8: warning: unused variable ‘ch’ [-Wunused-variable] char ch; ^ gcc compute-accuracy.c -o compute-accuracy -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -fno-stack-protector compute-accuracy.c: In function ‘main’: compute-accuracy.c:28:109: warning: unused variable ‘ch’ [-Wunused-variable] char st1[max_size], st2[max_size], st3[max_size], st4[max_size], bestw[N][max_size], file_name[max_size], ch; ^ chmod +x *.sh make: uscita dalla directory "/home/marco/node_modules/node-word2vec/src"

Aug 11 '15 06:08 marcoippolito

Hi Philipp, did you find something related? If you to test something, I can give it a try. Let me know. Maroc

Aug 12 '15 12:08 marcoippolito

Hi Philipp and hi Oskar, the indications here: https://bassnutz.wordpress.com/2012/09/09/processing-large-files-with-nodejs/ could be of help for importing GoogleNews-vectors-negative300.bin.gz?

Aug 19 '15 17:08 marcoippolito

Hi Marco, sorry for not following up, have been busy. Will look at your link shortly. Sorry for the delayed response. Best, Philipp

P.S. Did you try installing the package with npm install word2vec?

Aug 19 '15 17:08 Planeshifter

Hi Philipp, tomorrow I'm available the whole day to help. Let me know. Marco

Aug 19 '15 17:08 marcoippolito

Sorry for a late reply (and the bugs in readBinary :/). Anyways, seems I have incorrectly set the maximum string length to 50 for some reason (when it should be 100). Will fix. I do run out of memory though (this and the load times was the reason I gave up on using node-word2vec for my particular problem shortly after submitting the patch). I can give it a quick check if there is something obvious that can be done.

Aug 19 '15 19:08 oskarflordal

https://github.com/oskarflordal/node-word2vec/tree/strlenfix I removed an allocation to save a lot of of memory but I still run out when trying to read gnews.bin (after 25 minutes on my machine).

Aug 19 '15 20:08 oskarflordal

@oskarflordal The fact that I was able to load the bin file instead of the txt file means your pull request #6 fixed the strlen issue, so thank you!! But now we run into another wall: I ran out of memory on my giant Amazon instance. Or rather, NodeJS ran out of memory at about 4GB of usage.

As I suspected, the core problem here is not the memory of the machine, but that NodeJS has a maximum amount of memory it can use on a single worker (by default it's 512 MB but I ran the branch above at the theoretical 64-bit maximum of 4096 MB using the --max_old_space_size flag. See here for more info.

The Google News bin file is 3.4GB, very near that theoretical maximum, which would explain why a single worker would choke trying to process it. To process large files the code would have to be rewritten to stream the data from disk and process it in chunks, and/or farm it out to multiple workers. Unfortunately I don't have any experience with this myself...

Aug 19 '15 23:08 dariusk

My question is: how to divide the binary (or .gz) GoogleNews file into N-1 smaller files(N=number of cores), so it can be processed in parallel by N-1 workers?

Aug 20 '15 10:08 marcoippolito

I guess your options are:

Reconsider if this is really the way you want to solve your problem. Can you push the vectors into a database instead and perhaps calculate offline the closest vectors to each word?
Redo the word2vec binary format so that there are pointers to where you can find words at certain offset in the vocabulary. Currently each word consists of a string which length you find by parsing for whitespace and a set of floats so there is no way of finding entry X
Run a worker until you are close to running out of memory then start another one at the point you were at (I don't normally work with node or javascript so I have no idea if this is feasible or sensible)

Aug 20 '15 13:08 oskarflordal

My eventual solution was to shell out the actual work to a Python script and then consume the output back into my Node script... sigh

Aug 20 '15 16:08 dariusk

To solve the problem, I'm trying to deploy async's capabilities of node.js It's not that easy and straight, but I think it is the right path to follow. On Monday I will be back

Aug 20 '15 19:08 marcoippolito

@dariusk How did you convert GoogleNews-vectors-negative300.bin into txt file? Which bash command did you use? Few days ago I used "strings" bash command and worked. Now it doesn't, because I get only the word without the numerical vector

Aug 28 '15 17:08 marcoippolito

@marcoippolito I made this modification to the tool's source code and recompiled it.

Aug 28 '15 21:08 dariusk

Thanks @dariusk.

Aug 29 '15 08:08 marcoippolito

In case anyone else ends up here: I likewise was looking for a way to process a large binary model without memory ceiling issues, and finally just wrote a tiny function to stream the model to any destination: https://github.com/jasonphillips/word2vec-stream

Jan 28 '18 04:01 jasonphillips

I also tested and found out (sorry to the package owner) that https://github.com/LeeXun/word2vector/ is way faster (~14sec) in terms of loading and processing then this package (~30sec) on my machine.

Jul 22 '18 07:07 monbro

node-word2vec node-word2vec copied to clipboard

How to successfully load the GoogleNews-vectors-negative300 model?

node-word2vec
node-word2vec copied to clipboard