YouTokenToMe icon indicating copy to clipboard operation
YouTokenToMe copied to clipboard

"▁" character can be separated when using BPE-dropout

Open TIXFeniks opened this issue 6 years ago • 11 comments

When including BPE-dropout, word boundary character ('▁') can be separated from the first character of the word. This scenario is untested and could be harmful for the model training.

steps to reproduce:

>>> bpe.encode(["hello"], output_type=yttm.OutputType.SUBWORD, dropout_prob=1.0)                 
Out[45]: [['▁', 'h', 'e', 'l', 'l', 'o']]

(this also happens with dropout_prob < 1; I used 1 just for it to be reproducible)

Perhaps this behavior could be controlled by a flag to always merge '▁' and the next token?

TIXFeniks avatar Apr 04 '20 10:04 TIXFeniks

second this.

tnq177 avatar Apr 09 '20 07:04 tnq177

Hi This behaviour isn’t related to BPE-dropout.

If characters '▁' and 'h' did not merge, then they occurred not too many times together. This means that the algorithm instead of that combined the more frequent pairs of characters and most likely the more useful.

Could you describe in more detail why these symbols should be merged with higher priority?

xbelonogov avatar Apr 09 '20 13:04 xbelonogov

@xbelonogov I think @TIXFeniks refers to the special token '▁' that merges subwords, not the underscore '_'.

tnq177 avatar Apr 09 '20 13:04 tnq177

Yes, I also meant this special token '▁'. (Edited the previous comment)

xbelonogov avatar Apr 10 '20 08:04 xbelonogov

@xbelonogov I think '▁' should not be a token on its own but should always be attached to other token to indicate that's a subword, no?

tnq177 avatar Apr 10 '20 11:04 tnq177

It is not obvious to me. In practice for reasonably large vocabulary special token '▁' is almost always merged with the first symbol.

xbelonogov avatar Apr 11 '20 08:04 xbelonogov

I'm not 100% clear about how BPE is implemented in YTTM but let's take subword-nmt as an example. In subword-nmt, the word separator character (usually space " ") is not considered as part of training data to learn BPE. They only look at pairs of symbols within a word. If a word is split into BPE subwords, they append a special token "@@" at the end of former subword. For example, "hello" --> "he@@ llo". When apply BPE-dropout, it could become something like "h@@ e@@ ll@@ o". The special token "@@" itself can never be a token on its own, but is always appended to other true tokens to indicate how to merge subwords together later.

In this case, '▁' should behave similarly. If '▁' is a separate token, a NMT model could mistakenly learn to generate '▁' and even after merging subwords, there are a lot of spaces in between words. For example, my Slovak-English model generates this sentence: When I was 11 , I remember one mor ning w ak ing up the j oy ful sound s in my house .

tnq177 avatar Apr 12 '20 04:04 tnq177

YTTM is very similar to subword-nmt. The difference is the following:

  • In YTTM you can specify the exact number of tokens in the output vocabulary.
  • In Subword-nmt you specify the number of joining operations.

Subword-nmt creates 2 tokens for each character from alphabet: the original one and with @@ in the end. That can cause a problem if you are working with language with a big alphabet like Chinese. Final vocabulary may be too large.

The way how word splitting works is equivalent.

  • In Subword-nmt there are two types of suffixes [empty] and @@. The first means the end of the word, the second means the continuation.
  • In YTTM there are two types of prefixes and [empty]. The first means the beginning of the word, the second means the continuation.

Regarding your problem. Can you check how often this special token occurs alone in your train data? I just checked on some english dataset. This occurs on average once in 50 sentences. So I think this should not affect performance.

Also you can explore YTTM model with the following command.

yttm vocab --model model.yttm --verbose

It's easy to see that all tokens like ▁t, ▁h, etc exist.

xbelonogov avatar Apr 12 '20 08:04 xbelonogov

Regarding your problem. Can you check how often this special token occurs alone in your train data? I just checked on some english dataset. This occurs on average once in 50 sentences. So I think this should not affect performance. Yes, being separated from the word is indeed rare when not using BPE-dropout, but as BPE-dropout gets introduced, this happens very often (once in every couple of sentences in my real case, or once every word in the example I provided before with dropout probability of 1).

The solution could be an ability to disable dropout for such merges ( with something ) and not the others.

TIXFeniks avatar Apr 28 '20 11:04 TIXFeniks

@xbelonogov, what do you think about my suggestion from the previous message?

TIXFeniks avatar Jul 07 '20 08:07 TIXFeniks

Hi, @TIXFeniks Your suggestion looks reasonable. But I don't want to make one more options for disabling this type of splits. Every new options make interface more cumbersome and decrease usability. I am okay with doing this by default.

I asked Ivan Provilkov. But he isn't sure that this improve performance. If you have experiments that prove effectiveness of this, I will change default behaviour.

xbelonogov avatar Jul 10 '20 15:07 xbelonogov