word_embedding
word_embedding copied to clipboard
Sample code for training Word2Vec and FastText using wiki corpus and their pretrained word embedding.
Word Embedding
Sample code for training Word2Vec and FastText using wiki corpus and their pretrained word embedding.
For technical details, please read my blog: Chinese version English version
Environment Setup
I tested the code using Python 3.9, it may work on other Python version, but not guaranteed. Use poetry to setup the environment is recommended.
Poetry (recommended)
pip install poetry
poetry install
Pip
virtualenv .venv -p python3
source .venv/bin/activate
pip install -r requirement.txt
Train Word Embedding on Latest Wikidump
poetry run python train.py --lang en --model word2vec --size 300 --output data/en_wiki_word2vec_300.txt
--lang: en for English, zh for Chinese
--model: word2vec or fasttext
--size: number of dimension of trained word embedding
--output: path to save trained word embedding
If you are using pip, please run:
python train.py --lang en --model word2vec --size 300 --output data/en_wiki_word2vec_300.txt
Visualize the Trained Embedding:
The visualization supports only Chinese and English.
poetry run python demo.py --lang en --output data/en_wiki_word2vec_300.txt
--lang: en for English, zh for Chinese
--output: path for trained word embedding
If you are using pip, please run:
python demo.py --lang en --output data/en_wiki_word2vec_300.txt
Pretrained Word Embedding:
| Chinese | English | |
|---|---|---|
| Word2Vec | Download | Download |
| FastText | Download | Download |