YouTokenToMe
YouTokenToMe copied to clipboard
Is it possible to unset random seed for BPE-dropout?
In YouTokeToMe
BPE-dropout is always the same for the same input. That contradicts the idea described in the paper:
During segmentation, at each merge step some merges are randomly dropped with the probability p.
Could you please provide more information about the issue? I've tested yttm bpe dropout in python REPL and obtained different subword tokenization for different runs
>>> for _ in range(5):
... bpe.encode("i do not observe such behavior", dropout_prob=0.3)
...
[4, 52, 1644, 57, 16465, 1423, 78, 63, 1167, 31193, 1104, 19376, 73, 9407, 73, 9670, 52, 1936]
[4, 52, 1644, 57, 16465, 1423, 78, 63, 1167, 31193, 14245, 3730, 9407, 73, 9670, 52, 1936]
[4, 52, 19543, 2242, 57, 59, 1423, 78, 63, 51, 62, 79, 51, 14245, 3730, 4, 78, 51, 73, 9670, 52, 57, 62]
[4, 52, 19543, 16465, 1423, 78, 63, 1167, 79, 51, 14245, 3730, 9407, 73, 9670, 52, 1936]
[4, 52, 19543, 16465, 1423, 78, 63, 1167, 31193, 14245, 3730, 9407, 73, 9670, 52, 1936]
Sure, I'm getting the same output by using yttm encode
:
>>> for i in 1 2 3 4 5
... do
... echo "i do observe such behavior" | yttm encode --model model/path --output_type subword --dropout_prob 0.3
... done
...
n_threads: 4
▁ i ▁do ▁ob s erve ▁s uc h ▁behavior
bytes processed: 26
n_threads: 4
▁ i ▁do ▁ob s erve ▁s uc h ▁behavior
bytes processed: 26
n_threads: 4
▁ i ▁do ▁ob s erve ▁s uc h ▁behavior
bytes processed: 26
n_threads: 4
▁ i ▁do ▁ob s erve ▁s uc h ▁behavior
bytes processed: 26
n_threads: 4
▁ i ▁do ▁ob s erve ▁s uc h ▁behavior
bytes processed: 26
My version is 1.0.6
.