fastmoe
fastmoe copied to clipboard
Only 204 unique tokens (vocabulary size) in enwik8 (transformer-XL example)
Describe the bug When running the transformer-XL example on enwik8, the log shows there are only 204 unique tokens (vocabulary size) in enwik8 training set.
To Reproduce Steps to reproduce the behavior: bash ./scripts/run_enwik8_base.sh train
Expected behavior I am not sure how many unique tokens (vocabulary size) should be for enwik8, but I suppose it should be much larger.
Logs Run training... Experiment dir : LM-TFM-enwik8/20230706-192048 Producing dataset enwik8... building vocab with min_freq=0, max_size=None final vocab size 204 from 204 unique tokens
/home/username/fastmoe/examples/transformer-xl/train.py(194)
() -> ntokens = len(corpus.vocab) (Pdb) len(corpus.vocab) 204
Platform
- Device: NVIDIA Quadro RTX 8000
- OS: Ubuntu 18.04
- CUDA version: 11.4
- PyTorch version: 1.10.0
Additional context Add any other context about the problem here.