YouTokenToMe icon indicating copy to clipboard operation
YouTokenToMe copied to clipboard

Is it possible to unset random seed for BPE-dropout?

Open skurzhanskyi opened this issue 3 years ago • 2 comments

In YouTokeToMe BPE-dropout is always the same for the same input. That contradicts the idea described in the paper:

During segmentation, at each merge step some merges are randomly dropped with the probability p. 

skurzhanskyi avatar Sep 17 '20 11:09 skurzhanskyi

Could you please provide more information about the issue? I've tested yttm bpe dropout in python REPL and obtained different subword tokenization for different runs

>>> for _ in range(5):
...     bpe.encode("i do not observe such behavior", dropout_prob=0.3)
...
[4, 52, 1644, 57, 16465, 1423, 78, 63, 1167, 31193, 1104, 19376, 73, 9407, 73, 9670, 52, 1936]
[4, 52, 1644, 57, 16465, 1423, 78, 63, 1167, 31193, 14245, 3730, 9407, 73, 9670, 52, 1936]
[4, 52, 19543, 2242, 57, 59, 1423, 78, 63, 51, 62, 79, 51, 14245, 3730, 4, 78, 51, 73, 9670, 52, 57, 62]
[4, 52, 19543, 16465, 1423, 78, 63, 1167, 79, 51, 14245, 3730, 9407, 73, 9670, 52, 1936]
[4, 52, 19543, 16465, 1423, 78, 63, 1167, 31193, 14245, 3730, 9407, 73, 9670, 52, 1936]

kefirski avatar Sep 20 '20 17:09 kefirski

Sure, I'm getting the same output by using yttm encode:

>>> for i in 1 2 3 4 5
... do
...    echo "i do observe such behavior" | yttm encode --model model/path --output_type subword --dropout_prob 0.3   
... done
...
n_threads: 4
▁ i ▁do ▁ob s erve ▁s uc h ▁behavior 
bytes processed: 26
n_threads: 4
▁ i ▁do ▁ob s erve ▁s uc h ▁behavior 
bytes processed: 26
n_threads: 4
▁ i ▁do ▁ob s erve ▁s uc h ▁behavior 
bytes processed: 26
n_threads: 4
▁ i ▁do ▁ob s erve ▁s uc h ▁behavior 
bytes processed: 26
n_threads: 4
▁ i ▁do ▁ob s erve ▁s uc h ▁behavior 
bytes processed: 26

My version is 1.0.6.

skurzhanskyi avatar Sep 20 '20 18:09 skurzhanskyi