texar-pytorch icon indicating copy to clipboard operation
texar-pytorch copied to clipboard

Why SentencePieceTokenizer can't save vocab file

Open Codle opened this issue 5 years ago • 3 comments

I want to use vocab file in PairedDataloader, but the the save_vocab function of SentencePieceTokenizer only save the model file.

The model file can't be load by Dataloader because of decoding error.

In sentencepiece_tokenizer.py, I saw you delete the vocab file.

Codle avatar Dec 29 '19 04:12 Codle

We deleted sentencepiece vocab file because sentencepiece mode file is purely self-contained, and vocab file is never used in the tokenizer. To the best of my knowledge, the vocab file itself is not very useful. Here is a simple vocab file:

<unk>	0
<s>	0
</s>	0
,	-3.39764
.	-3.53133
▁the	-3.56031
s	-3.70819
▁	-3.82609
▁I	-3.90308
▁to	-4.04041
▁a	-4.08637
ed	-4.16661
▁and	-4.26836
▁of	-4.27461
t	-4.31782
e	-4.43336
d	-4.44333
ing	-4.46929
a	-4.53839
▁in	-4.64852
o	-4.71318
▁was	-4.77909
▁"	-4.81017
i	-4.86229
...

gpengzhi avatar Dec 30 '19 16:12 gpengzhi

@gpengzhi But how to use the model file in PairedTextData? The model file seems only can be used to restore a tokenizer, so I created my own "PairedTextData" with two DataSource to use SentencePieceTokenizer in my project. Is there anyway more simple to use?

Codle avatar Dec 31 '19 08:12 Codle

Could you write down how you integrate tokenizer with pairedtextdata? There is another related issue #256 I think we should provide the interface to use tokenizer instead of vocab. Do you think if you can contribute to this feature enhancement? A feature enhancement pull request is welcome!

gpengzhi avatar Dec 31 '19 19:12 gpengzhi