training-word2vec
training-word2vec copied to clipboard
How to train your own word2vec model for use with ml5.js
Training
Python Environment
Requirements
- Set up a python environment with gensim installed. More detailed instructions here. You can also follow this video tutorial about Python virtualenv.
pip install -r requirements.txt
Train the model
- Clone this repository or download this python script
git clone https://github.com/ml5js/training-word2vec/
- The script supports training from a single text file or directory of files. Create a text file or folder of multiple files. Now run
train.pywith the name of the file or folder.
Example:
python train.py file.xt
python train.py files/
- The script will output a
vectors.txtandvectors.jsonfile, however, if you would like to specify an output file name you can use the additional argument-ofor that.
python train.py data.txt -o output.json
- The output JSON file can be used now with the ml5.js word2vec examples.
Advanced tokenization
The default tokenizer is very basic. You can ask the script to use NLTK's
tokenizer with the --tokenizer argument.
Additionally, the script can remove stop words.
python train.py files/ -t nltk --remove-stop-words