node-word2vec
node-word2vec copied to clipboard
How to successfully load the GoogleNews-vectors-negative300 model?
Hi Philipp, I downloaded from https://code.google.com/p/word2vec/ the file GoogleNews-vectors-negative300.bin.gz
w2v = require('word2vec'); { word2vec: [Function: word2vec], word2phrase: [Function: word2phrase], loadModel: [Function: loadModel], WordVector: [Function: WordVector] }
w2v.loadModel("/home/marco/crawlscrape/bashUtilitiesDir/GoogleNews-vectors-negative300.bin", function(err, model) { ... console.log(model); ... }); undefined TypeError: Cannot read property 'length' of undefined at /home/marco/node_modules/word2vec/lib/model.js:408:30 at FSReqWrap.wrapper as oncomplete
w2v.loadModel("/home/marco/crawlscrape/bashUtilitiesDir/GoogleNews-vectors-negative300.bin", function(err, model) { ... console.log(model); ... }); undefined TypeError: undefined is not a function at readOne (/home/marco/node_modules/word2vec/lib/model.js:433:55) at FSReqWrap.wrapper as oncomplete
What do I have to do in order to successfully load the GoogleNews-vectors-negative300 model?
Looking forward to your kind help. Marco
I tried doing this last week -- I'm pretty sure that it doesn't accept trained models in the binary (.bin) format, only in text format. While it's possible to convert the binary format to text, the resulting model is so big that it caused Node.js to run out of memory while consuming it. (This is independent of the RAM of the machine and has more to do with the limitations on address space on a 64-bit CPU.)
Hi. If the only accepted format is text format, with which the resulting model of GoogleNews-vectors-negative300.bin is so big that it causes Node.js to run out of memory while consuming it, this module, despite being potentially very usefull in many situations, cannot now be deployed and used. The best would be, as I do with a Python module, to directly load the compressed file, GoogleNews-vectors-negative300.bin.gz, in order to speed up the loading activity, and save some memory (a scarse resource, even for powerful machines).....What do you think Philipp?
I don't have a good Internet connection right now (using my phone as a router), but will look into this later this afternoon.
Okay, I made some little changes to the code. Could you please clone the Github repo, run npm install
and try again? Cannot check as I do not have the corpus available right now.
Hi Philipp, tell me please what I'm doing wrong...
marco@pc:~/node_modules$ git clone https://github.com/Planeshifter/node-word2vec.git Cloning into 'node-word2vec'... remote: Counting objects: 329, done. remote: Total 329 (delta 0), reused 0 (delta 0), pack-reused 329 Ricezione degli oggetti: 100% (329/329), 283.85 KiB | 355.00 KiB/s, done. Risoluzione dei delta: 100% (175/175), done. Checking connectivity... fatto. marco@pc:~/node_modules$ cd node-word2vec marco@pc:~/node_modules/node-word2vec$ ls -a . .. data .editorconfig examples .git .gitignore .jshintignore .jshintrc lib LICENSE .npmignore package.json README.md src test .travis.yml
sudo npm install npm WARN package.json [email protected] No README data npm WARN package.json [email protected] No bin file found at ./bin/http-server npm WARN cannot run in wd [email protected] make --directory=src (wd=/home/marco/node_modules/node-word2vec)
It seems that one needs to set the flag unsafe-perm=true
to run npm scripts as root, so your sudo
was causing this issue. I pushed a little fix such that your code should work now. Could you try again? Thanks,
Philipp
marco@pc:~/node_modules$ rm -rf node-word2vec marco@pc:~/node_modules$ git clone https://github.com/Planeshifter/node-word2vec.git Cloning into 'node-word2vec'... remote: Counting objects: 333, done. remote: Compressing objects: 100% (3/3), done. remote: Total 333 (delta 0), reused 0 (delta 0), pack-reused 329 Ricezione degli oggetti: 100% (333/333), 285.03 KiB | 0 bytes/s, done. Risoluzione dei delta: 100% (175/175), done. Checking connectivity... fatto.
marco@pc:~$ sudo npm install [sudo] password for marco: npm WARN package.json [email protected] No README data npm WARN package.json [email protected] No bin file found at ./bin/http-server npm WARN cannot run in wd [email protected] make --directory=src (wd=/home/marco/node_modules/node-word2vec) marco@pc:~$ sudo npm install unsafe-perm=true npm WARN package.json [email protected] No README data npm WARN package.json [email protected] No bin file found at ./bin/http-server npm ERR! addLocal Could not install /home/marco/unsafe-perm=true npm ERR! Linux 3.13.0-32-generic npm ERR! argv "/usr/bin/node" "/usr/bin/npm" "install" "unsafe-perm=true" npm ERR! node v0.12.7 npm ERR! npm v2.11.3 npm ERR! path /home/marco/unsafe-perm=true npm ERR! code ENOENT npm ERR! errno -2
npm ERR! enoent ENOENT, open '/home/marco/unsafe-perm=true' npm ERR! enoent This is most likely not a problem with npm itself npm ERR! enoent and is related to npm not being able to find a file. npm ERR! enoent
npm ERR! Please include the following file with any support request: npm ERR! /home/marco/npm-debug.log
I'm traveling but I'll definitely give this a shot this weekend.
I did this: npm install npm WARN package.json [email protected] No README data npm WARN package.json [email protected] No bin file found at ./bin/http-server
[email protected] postinstall /home/marco/node_modules/node-word2vec make --directory=src
make: ingresso nella directory "/home/marco/node_modules/node-word2vec/src" gcc word2vec.c -o word2vec -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -fno-stack-protector gcc word2phrase.c -o word2phrase -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -fno-stack-protector gcc distance.c -o distance -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -fno-stack-protector distance.c: In function ‘main’: distance.c:31:8: warning: unused variable ‘ch’ [-Wunused-variable] char ch; ^ gcc word-analogy.c -o word-analogy -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -fno-stack-protector word-analogy.c: In function ‘main’: word-analogy.c:31:8: warning: unused variable ‘ch’ [-Wunused-variable] char ch; ^ gcc compute-accuracy.c -o compute-accuracy -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -fno-stack-protector compute-accuracy.c: In function ‘main’: compute-accuracy.c:28:109: warning: unused variable ‘ch’ [-Wunused-variable] char st1[max_size], st2[max_size], st3[max_size], st4[max_size], bestw[N][max_size], file_name[max_size], ch; ^ chmod +x *.sh make: uscita dalla directory "/home/marco/node_modules/node-word2vec/src" marco@pc:~$ node
w2v = require('word2vec'); Error: Cannot find module 'word2vec' at Function.Module._resolveFilename (module.js:336:15) at Function.Module._load (module.js:278:25) at Module.require (module.js:365:17) at require (module.js:384:17) at repl:1:7 at REPLServer.defaultEval (repl.js:132:27) at bound (domain.js:254:14) at REPLServer.runBound as eval at REPLServer.
(repl.js:279:12) at REPLServer.emit (events.js:107:17)
npm install unsafe-perm=true npm WARN package.json [email protected] No README data npm WARN package.json [email protected] No bin file found at ./bin/http-server npm ERR! addLocal Could not install /home/marco/unsafe-perm=true npm ERR! Linux 3.13.0-32-generic npm ERR! argv "/usr/bin/node" "/usr/bin/npm" "install" "unsafe-perm=true" npm ERR! node v0.12.7 npm ERR! npm v2.11.3 npm ERR! path /home/marco/unsafe-perm=true npm ERR! code ENOENT npm ERR! errno -2
npm ERR! enoent ENOENT, open '/home/marco/unsafe-perm=true' npm ERR! enoent This is most likely not a problem with npm itself npm ERR! enoent and is related to npm not being able to find a file. npm ERR! enoent
npm ERR! Please include the following file with any support request: npm ERR! /home/marco/npm-debug.log
What do I have to do Philipp?
Hmm, solving this problem seems to be more complicated than I had hoped. If you want to have a look yourself, the code to read binary files is located in the function readBinary
in model.js
. This code was generously contributed by @oskarflordal and is not written by myself. One of the errors @marcoippolito ran into was caused by the fact that as of node version v0.12, typed arrays do not possess a slice method anymore.
And somehow in the GoogleNews data set all words are missing their first characters when extracted from the binary data, the likely cause for the TypeError: Cannot read property 'length' of undefined
error.
After having run a bunch of tests, it seems that right now the code does not correctly read in the vector values from the binary data, either. Oskar, if you find the time, could you have a look?
I need to look into this when I have more time. I fear this won't be resolved in a short amount of time, unfortunately.
Just published a new version of the package to npm with some changes in the readBinary
function. Could you try installing it as usual via npm install word2vec
and then loading the GoogleNews corpus again? My Laptop does not handle the large file size of 3.5GB, so I cannot check whether the problem is solved. Thanks!
git clone https://github.com/Planeshifter/node-word2vec.git Cloning into 'node-word2vec'... remote: Counting objects: 349, done. remote: Compressing objects: 100% (23/23), done. remote: Total 349 (delta 9), reused 0 (delta 0), pack-reused 325 Ricezione degli oggetti: 100% (349/349), 294.95 KiB | 0 bytes/s, done. Risoluzione dei delta: 100% (181/181), done. Checking connectivity... fatto.
sudo npm install [sudo] password for marco: npm WARN package.json [email protected] No README data npm WARN package.json [email protected] No bin file found at ./bin/http-server npm WARN cannot run in wd [email protected] make --directory=src (wd=/home/marco/node_modules/node-word2vec) marco@pc:~$ npm install npm WARN package.json [email protected] No README data npm WARN package.json [email protected] No bin file found at ./bin/http-server
[email protected] postinstall /home/marco/node_modules/node-word2vec make --directory=src
make: ingresso nella directory "/home/marco/node_modules/node-word2vec/src" gcc word2vec.c -o word2vec -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -fno-stack-protector gcc word2phrase.c -o word2phrase -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -fno-stack-protector gcc distance.c -o distance -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -fno-stack-protector distance.c: In function ‘main’: distance.c:31:8: warning: unused variable ‘ch’ [-Wunused-variable] char ch; ^ gcc word-analogy.c -o word-analogy -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -fno-stack-protector word-analogy.c: In function ‘main’: word-analogy.c:31:8: warning: unused variable ‘ch’ [-Wunused-variable] char ch; ^ gcc compute-accuracy.c -o compute-accuracy -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -fno-stack-protector compute-accuracy.c: In function ‘main’: compute-accuracy.c:28:109: warning: unused variable ‘ch’ [-Wunused-variable] char st1[max_size], st2[max_size], st3[max_size], st4[max_size], bestw[N][max_size], file_name[max_size], ch; ^ chmod +x *.sh make: uscita dalla directory "/home/marco/node_modules/node-word2vec/src"
Hi Philipp, did you find something related? If you to test something, I can give it a try. Let me know. Maroc
Hi Philipp and hi Oskar, the indications here: https://bassnutz.wordpress.com/2012/09/09/processing-large-files-with-nodejs/ could be of help for importing GoogleNews-vectors-negative300.bin.gz?
Hi Marco, sorry for not following up, have been busy. Will look at your link shortly. Sorry for the delayed response. Best, Philipp
P.S. Did you try installing the package with npm install word2vec
?
Hi Philipp, tomorrow I'm available the whole day to help. Let me know. Marco
Sorry for a late reply (and the bugs in readBinary :/). Anyways, seems I have incorrectly set the maximum string length to 50 for some reason (when it should be 100). Will fix. I do run out of memory though (this and the load times was the reason I gave up on using node-word2vec for my particular problem shortly after submitting the patch). I can give it a quick check if there is something obvious that can be done.
https://github.com/oskarflordal/node-word2vec/tree/strlenfix I removed an allocation to save a lot of of memory but I still run out when trying to read gnews.bin (after 25 minutes on my machine).
@oskarflordal The fact that I was able to load the bin file instead of the txt file means your pull request #6 fixed the strlen issue, so thank you!! But now we run into another wall: I ran out of memory on my giant Amazon instance. Or rather, NodeJS ran out of memory at about 4GB of usage.
As I suspected, the core problem here is not the memory of the machine, but that NodeJS has a maximum amount of memory it can use on a single worker (by default it's 512 MB but I ran the branch above at the theoretical 64-bit maximum of 4096 MB using the --max_old_space_size
flag. See here for more info.
The Google News bin file is 3.4GB, very near that theoretical maximum, which would explain why a single worker would choke trying to process it. To process large files the code would have to be rewritten to stream the data from disk and process it in chunks, and/or farm it out to multiple workers. Unfortunately I don't have any experience with this myself...
My question is: how to divide the binary (or .gz) GoogleNews file into N-1 smaller files(N=number of cores), so it can be processed in parallel by N-1 workers?
I guess your options are:
- Reconsider if this is really the way you want to solve your problem. Can you push the vectors into a database instead and perhaps calculate offline the closest vectors to each word?
- Redo the word2vec binary format so that there are pointers to where you can find words at certain offset in the vocabulary. Currently each word consists of a string which length you find by parsing for whitespace and a set of floats so there is no way of finding entry X
- Run a worker until you are close to running out of memory then start another one at the point you were at (I don't normally work with node or javascript so I have no idea if this is feasible or sensible)
My eventual solution was to shell out the actual work to a Python script and then consume the output back into my Node script... sigh
To solve the problem, I'm trying to deploy async's capabilities of node.js It's not that easy and straight, but I think it is the right path to follow. On Monday I will be back
@dariusk How did you convert GoogleNews-vectors-negative300.bin into txt file? Which bash command did you use? Few days ago I used "strings" bash command and worked. Now it doesn't, because I get only the word without the numerical vector
@marcoippolito I made this modification to the tool's source code and recompiled it.
Thanks @dariusk.
In case anyone else ends up here: I likewise was looking for a way to process a large binary model without memory ceiling issues, and finally just wrote a tiny function to stream the model to any destination: https://github.com/jasonphillips/word2vec-stream
I also tested and found out (sorry to the package owner) that https://github.com/LeeXun/word2vector/ is way faster (~14sec) in terms of loading and processing then this package (~30sec) on my machine.