paraphrastic-representations-at-scale
paraphrastic-representations-at-scale copied to clipboard
ParaNMT: sentencepiece model and .pt model vocab and embedding size mismatch?
First of all, thank you for your contribution to the STS task, we're very excited to get our hands on the models you provided! 😊
Problem
According to the README, when we want to use the pre-trained ParaNMT model model.para.lc.100.pt
for scoring sentence pairs (we tested also fine-tuning), it should be used with the sentencepiece model paranmt.model
.
# README.md
python -u score_sentence_pairs.py \
--sentence-pair-file paraphrase-at-scale/example-sentences-pairs.txt
--load-file paraphrase-at-scale/model.para.lc.100.pt \
--sp-model paraphrase-at-scale/paranmt.model \
--gpu 0
However, we observe a potential mismatch between the model.para.lc.100.pt
embedding layer size and the size of vocabulary of the paranmt.model
sentencepiece model.
-
model.para.lc.100.pt
has an embedding layer of shapeEmbedding(82983, 1024)
. -
paranmt.model
sentencepiece model has a vocab size of 50000.
We analyzed the situation further by printing various tokens from the paranmt.model
sentencepiece model and they are identical with tokens in paranmt.vocab
. However model.para.lc.100.pt
was most probably trained using paranmt.sim-low=0.4-sim-high=1.0-ovl=0.7.final.vocab
, whose size is exactly 82982 tokens (don't know why the shift of 1 is there 😄 ). We thus believe that a different sentencepiece model (some that has 82983 tokens) should be used with the model.para.lc.100.pt
in order to get correct results.
# score_sentence_pairs.py
model, _ = load_model(None, args)
print(model.embedding) # Embedding(82983, 1024)
print(model.sp.vocab_size()) # 50000
# All of the following tokens agree with the order in the `paranmt.vocab`
print(model.sp.id_to_piece(0)) # <unk>
print(model.sp.id_to_piece(4)) # _the
print(model.sp.id_to_piece(10)) # _i
print(model.sp.piece_to_id('▁i')) # 10
Possible solutions
We see two possible solutions:
- Publish the sentencepiece model that was used for training of the
model.para.lc.100.pt
model. - Publish a model trained on ParaNMT that used the
paranmt.model
sentencepiece model during training. Safety check is that the model should have embedding layer of size 50000.
If we missed something, could you please explain how the models were meant to be used? If we're right, would it please be possible to share the remaining resources?
I am facing the same issue