sentencepiece icon indicating copy to clipboard operation
sentencepiece copied to clipboard

Is the loss computation in UnigramTrainer correct?

Open mbollmann opened this issue 5 years ago • 1 comments

When computing logsum_alt, the frequency of a removed piece is re-assigned to alternatives:

https://github.com/google/sentencepiece/blob/ba7e11a17f606327d0652528d58d2dd8cd265c6f/src/unigram_model_trainer.cc#L389-L394

But the code uses alternatives.size() which, if I'm not mistaken, is always equal to sentencepieces.size(). Don't we want to multiply with the number of alternatives for this particular sentencepiece, i.e., alternatives[i].size()? @taku910 ?

mbollmann avatar Feb 18 '21 14:02 mbollmann

@taku910 I ran into the issue when training a Transformer-XL model on various configurations of WikiText-103 using sentencepiece:

  1. My configuration for sentencepiece does not treat whitespace as a special character, thus my vocab includes phrases.
  2. I first detokenize WikiText-103, then train the Unigram tokenizer over the training split of the dataset, before running sentencepiece with a 256k vocab size and max piece length of 32.
  3. After tokenizing the training split, the number of tokens is ~26m (compared to 103m tokens for the closed vocabulary version of WikiText-103). This is a 4x reduction in tokens. Yet, the tokenized version of the validation set only has a 1.3x reduction.
  4. Just as a sanity check I re-trained the Unigram tokenizer over both the training and validation splits, but the number of tokens is roughly the same.
  5. A Transformer-XL model trained over this preprocessed dataset severally overfits to the training set and generalizes very poorly (validation ppl ~52.8).
  6. When I make the proposed fix (changing alternatives.size() to alternatives[i].size(), and re-train the tokenizer over just the training splt, the discrepancy in token counts between training and validation split is fixed (both have an approximately 1.3x reduction in number of tokens).
  7. A Transformer-XL model trained over the fixed version of sentencepiece gets a reasonable perplexity (validation ppl 27.9 compared to 23.9 ppl I get when training on the closed vocab version of WikiText-103).

I looked at the code and @mbollmann seems to be correct that alternatives.size() is always equal to sentencepieces.size().

dojoteef avatar Jan 12 '22 15:01 dojoteef

Sorry for the late response. yes, the computation was incorrect. Going to be fixed in the next release.

taku910 avatar Apr 24 '23 08:04 taku910

Fixed in https://github.com/google/sentencepiece/releases/tag/v0.1.99

taku910 avatar May 02 '23 04:05 taku910