marian-dev icon indicating copy to clipboard operation
marian-dev copied to clipboard

Sentencepiece options vs Vocab size

Open alvations opened this issue 3 years ago • 1 comments
trafficstars

Bug description

Possibly this is a feature and not a bug.

Sometimes there's a conflict in

  • --dim-vocabs and
  • --sentencepiece-options "--character_coverage=1.0"

When the sentencepiece enforces 1.0 character_coverage, the vocab size when larger than dim-vocabs set, it throws an error of different dimensions.

How to reproduce

Try finding a dataset with --sentencepiece-options "--character_coverage=1.0" creating more no. of tokens than --dim-vocabs:

~/: $HOME/marian/build/marian --model $MODELDIR/model.npz --type transformer \
--train-sets train.en train.ja \
--vocabs vocab.src.spm vocab.trg.spm \
--dim-vocabs 8000 8000 \
--sentencepiece-options "--character_coverage=1.0" 

Context

$ ~/marian/build/marian --version
v1.11.0 f00d0621 2022-02-08 08:39:24 -0800

alvations avatar May 27 '22 18:05 alvations

For marian devs,

Not sure what's the resolution in code though, throw a warning to user? Updating the dim-vocabs to max(sp_vocab_size, 8000)?


For marian users, to avoid this feature/bug

Set dim-vocabs to a larger value than vocab size created by sentencepiece.

alvations avatar May 27 '22 18:05 alvations