paraphrastic-representations-at-scale icon indicating copy to clipboard operation
paraphrastic-representations-at-scale copied to clipboard

ParaNMT: sentencepiece model and .pt model vocab and embedding size mismatch?

Open sweco opened this issue 3 years ago • 1 comments

First of all, thank you for your contribution to the STS task, we're very excited to get our hands on the models you provided! 😊

Problem

According to the README, when we want to use the pre-trained ParaNMT model model.para.lc.100.pt for scoring sentence pairs (we tested also fine-tuning), it should be used with the sentencepiece model paranmt.model.

# README.md

python -u score_sentence_pairs.py \
  --sentence-pair-file paraphrase-at-scale/example-sentences-pairs.txt
  --load-file paraphrase-at-scale/model.para.lc.100.pt \
  --sp-model paraphrase-at-scale/paranmt.model \
   --gpu 0

However, we observe a potential mismatch between the model.para.lc.100.pt embedding layer size and the size of vocabulary of the paranmt.model sentencepiece model.

  • model.para.lc.100.pt has an embedding layer of shape Embedding(82983, 1024).
  • paranmt.model sentencepiece model has a vocab size of 50000.

We analyzed the situation further by printing various tokens from the paranmt.model sentencepiece model and they are identical with tokens in paranmt.vocab. However model.para.lc.100.pt was most probably trained using paranmt.sim-low=0.4-sim-high=1.0-ovl=0.7.final.vocab, whose size is exactly 82982 tokens (don't know why the shift of 1 is there 😄 ). We thus believe that a different sentencepiece model (some that has 82983 tokens) should be used with the model.para.lc.100.pt in order to get correct results.

# score_sentence_pairs.py

model, _ = load_model(None, args)
print(model.embedding)  # Embedding(82983, 1024)
print(model.sp.vocab_size())  # 50000

# All of the following tokens agree with the order in the `paranmt.vocab`
print(model.sp.id_to_piece(0))  # <unk>
print(model.sp.id_to_piece(4))  # _the
print(model.sp.id_to_piece(10))  # _i
print(model.sp.piece_to_id('▁i'))  # 10

Possible solutions

We see two possible solutions:

  1. Publish the sentencepiece model that was used for training of the model.para.lc.100.pt model.
  2. Publish a model trained on ParaNMT that used the paranmt.model sentencepiece model during training. Safety check is that the model should have embedding layer of size 50000.

If we missed something, could you please explain how the models were meant to be used? If we're right, would it please be possible to share the remaining resources?

sweco avatar Jul 28 '21 18:07 sweco

I am facing the same issue

tanmaylaud avatar Oct 05 '21 22:10 tanmaylaud