marian-dev
marian-dev copied to clipboard
Sentencepiece options vs Vocab size
trafficstars
Bug description
Possibly this is a feature and not a bug.
Sometimes there's a conflict in
--dim-vocabsand--sentencepiece-options "--character_coverage=1.0"
When the sentencepiece enforces 1.0 character_coverage, the vocab size when larger than dim-vocabs set, it throws an error of different dimensions.
How to reproduce
Try finding a dataset with --sentencepiece-options "--character_coverage=1.0" creating more no. of tokens than --dim-vocabs:
~/: $HOME/marian/build/marian --model $MODELDIR/model.npz --type transformer \
--train-sets train.en train.ja \
--vocabs vocab.src.spm vocab.trg.spm \
--dim-vocabs 8000 8000 \
--sentencepiece-options "--character_coverage=1.0"
Context
$ ~/marian/build/marian --version
v1.11.0 f00d0621 2022-02-08 08:39:24 -0800
For marian devs,
Not sure what's the resolution in code though, throw a warning to user?
Updating the dim-vocabs to max(sp_vocab_size, 8000)?
For marian users, to avoid this feature/bug
Set dim-vocabs to a larger value than vocab size created by sentencepiece.