fastmoe icon indicating copy to clipboard operation
fastmoe copied to clipboard

Only 204 unique tokens (vocabulary size) in enwik8 (transformer-XL example)

Open chenwydj opened this issue 1 year ago • 3 comments

Describe the bug When running the transformer-XL example on enwik8, the log shows there are only 204 unique tokens (vocabulary size) in enwik8 training set.

To Reproduce Steps to reproduce the behavior: bash ./scripts/run_enwik8_base.sh train

Expected behavior I am not sure how many unique tokens (vocabulary size) should be for enwik8, but I suppose it should be much larger.

Logs Run training... Experiment dir : LM-TFM-enwik8/20230706-192048 Producing dataset enwik8... building vocab with min_freq=0, max_size=None final vocab size 204 from 204 unique tokens

/home/username/fastmoe/examples/transformer-xl/train.py(194)() -> ntokens = len(corpus.vocab) (Pdb) len(corpus.vocab) 204

Platform

  • Device: NVIDIA Quadro RTX 8000
  • OS: Ubuntu 18.04
  • CUDA version: 11.4
  • PyTorch version: 1.10.0

Additional context Add any other context about the problem here.

chenwydj avatar Jul 07 '23 03:07 chenwydj