Adding tokenizers and a modeling file so that your models work without special tips, except for [MASK] problem.

For `rinna/japanese-roberta-base`:

Adding tokenization_roberta_japanese.py, modeling_roberta_japanese.py, and modeling_tf_roberta_japanese.py.

The difference between `T5Tokenizer` and `RobertaJapaneseTokenizer`

RobertaJapaneseTokenizer will add [CLS] automatically.
~~In languages without inter-word whitespaces, such as Japanese and Chinese, you should have trained SentencePiece with --add_dummy_prefix set to false. With --add_dummy_prefix set to true, extra whitespace tokens will appear. This is why A) Directly typing [MASK] in an input string and B) replacing a token with [MASK] after tokenization will yield different token sequences. Therefore, RobertaJapaneseTokenizer has a workaround for this problem.~~
~~Removed do_lower_case option. It is because do_lower_case option was not working in your pretraining code.~~
Enabled do_lower_case option.

The difference between `RobertaModel` and `RobertaJapaneseModel`

position_ids starts with 0. Therefore, it will be no longer necessary to explicitly provide position_ids.

For `rinna/japanese-gpt2-*`:

Adding tokenization_gpt2_japanese.py.

The difference between `T5Tokenizer` and `GPT2JapaneseTokenizer`

~~In languages without inter-word whitespaces, such as Japanese and Chinese, you should have trained SentencePiece with --add_dummy_prefix set to false. With --add_dummy_prefix set to true, extra whitespace tokens will appear. Therefore, GPT2JapaneseTokenizer has a workaround for this problem.~~
~~Removed do_lower_case option. It is because do_lower_case option was not working in your pretraining code.~~
Enabled do_lower_case option.

Aug 10 '22 23:08 azonti

I re-enabled do_lower_case option, because a lot of <unk> tokens appear without this option (why?).

Aug 12 '22 20:08 azonti

@azonti Sorry for this late reply and thanks for the PR.

I notice that in the workaround code for the whitespace prefix issue, add_dummy_prefix is explicitly set to False. It can sometimes result in different tokenization results when compared with the default T5Tokenizer, for example

text = "おはようございます"
a_t5_tokenizer.tokenize(text)  # ['▁お', '時間', 'ありがとう', 'ご', 'ざい', 'ます']
a_new_tokenizer.tokenize(text)  # ['お', '時間', 'ありがとう', 'ご', 'ざい', 'ます']

It might be okay if one is using this code for pretraining from scratch. But if someone is using this code to finetune a pretrained model, this behaviour difference will make the tokenization of finetuning data (which eliminates prefix whitespaces) inconsistent with pretraining data (which keeps prefix whitespaces).

Probably we can find a better way than changing add_dummy_prefix.

Aug 23 '22 09:08 ZHAOTING

As with the do_lower_case problem, it is due to that the training data of this tokenizer does not contain any upper-cased letters. If we do not enable do_lower_case as a preprocessing, all the upper-cased letters in inputs will result in <unk>.

Aug 23 '22 09:08 ZHAOTING

OK, let's get the problems straight.

`add_dummy_prefix` Problem

In languages without interword whitespaces, such as Chinese or Japanese, you should have pretrained with add_dummy_prefix set to disable. This can no longer be fixed.
I think they should set add_dummy_prefix to disable for inference or finetuning, even though it will be inconsistent with pretraining configuration, in order to prevent the insertion of unnecessary whitespaces during tokenization. I think results will be robust enough, even under the inconsistency.

`do_lower_case` Problem

Why does not the training data of the tokenizer contain any upper-cased letters? (manually lowercased?)

Aug 24 '22 16:08 azonti

In languages without interword whitespaces, such as Chinese or Japanese, you should have pretrained with add_dummy_prefix set to disable. This can no longer be fixed.

I totally agree with it.

Why does not the training data of the tokenizer contain any upper-cased letters? (manually lowercased?)

I do not remember the exact reason. But looking back at it, it was just a really bad decision.

Regarding the above two points, I think it is better to provide a better pretrained tokenizer (and possibly corresponding new pretrained models). This is going to happen in a near future.

Aug 25 '22 06:08 ZHAOTING

I think they should set add_dummy_prefix to disable for inference or finetuning, even though it will be inconsistent with pretraining configuration, in order to prevent the insertion of unnecessary whitespaces during tokenization.

Sorry, but I think keeping consistency is more important than sometimes skipping a few steps to fix [MASK] token.

I think results will be robust enough, even under the inconsistency.

The downside of using the current tokenizer is completely known (which is the [MASK] problem), and we know how to solve it.

But the downside of enforcing add_dummy_prefix = True cannot be foreseen. It could cause performance degradation in various downstream tasks. If we are saying "it is going to be robust", I believe we need tests to verify the hypothesis.

Aug 25 '22 06:08 ZHAOTING

OK, I agree with your opinion, then I removed the workaround for add_dummy_prefix.

Aug 27 '22 00:08 azonti

japanese-pretrained-models
japanese-pretrained-models copied to clipboard

add tokenizer & model

For `rinna/japanese-roberta-base`:

The difference between `T5Tokenizer` and `RobertaJapaneseTokenizer`

The difference between `RobertaModel` and `RobertaJapaneseModel`

For `rinna/japanese-gpt2-*`:

The difference between `T5Tokenizer` and `GPT2JapaneseTokenizer`

`add_dummy_prefix` Problem

`do_lower_case` Problem

japanese-pretrained-models japanese-pretrained-models copied to clipboard

add tokenizer & model

For rinna/japanese-roberta-base:

The difference between T5Tokenizer and RobertaJapaneseTokenizer

The difference between RobertaModel and RobertaJapaneseModel

For rinna/japanese-gpt2-*:

The difference between T5Tokenizer and GPT2JapaneseTokenizer

add_dummy_prefix Problem

do_lower_case Problem

japanese-pretrained-models
japanese-pretrained-models copied to clipboard

For `rinna/japanese-roberta-base`:

The difference between `T5Tokenizer` and `RobertaJapaneseTokenizer`

The difference between `RobertaModel` and `RobertaJapaneseModel`

For `rinna/japanese-gpt2-*`:

The difference between `T5Tokenizer` and `GPT2JapaneseTokenizer`

`add_dummy_prefix` Problem

`do_lower_case` Problem