japanese-pretrained-models icon indicating copy to clipboard operation
japanese-pretrained-models copied to clipboard

add tokenizer & model

Open azonti opened this issue 2 years ago • 7 comments

Adding tokenizers and a modeling file so that your models work without special tips, except for [MASK] problem.

For rinna/japanese-roberta-base:

Adding tokenization_roberta_japanese.py, modeling_roberta_japanese.py, and modeling_tf_roberta_japanese.py.

The difference between T5Tokenizer and RobertaJapaneseTokenizer

  1. RobertaJapaneseTokenizer will add [CLS] automatically.
  2. ~~In languages without inter-word whitespaces, such as Japanese and Chinese, you should have trained SentencePiece with --add_dummy_prefix set to false. With --add_dummy_prefix set to true, extra whitespace tokens will appear. This is why A) Directly typing [MASK] in an input string and B) replacing a token with [MASK] after tokenization will yield different token sequences. Therefore, RobertaJapaneseTokenizer has a workaround for this problem.~~
  3. ~~Removed do_lower_case option. It is because do_lower_case option was not working in your pretraining code.~~
  4. Enabled do_lower_case option.

The difference between RobertaModel and RobertaJapaneseModel

  1. position_ids starts with 0. Therefore, it will be no longer necessary to explicitly provide position_ids.

For rinna/japanese-gpt2-*:

Adding tokenization_gpt2_japanese.py.

The difference between T5Tokenizer and GPT2JapaneseTokenizer

  1. ~~In languages without inter-word whitespaces, such as Japanese and Chinese, you should have trained SentencePiece with --add_dummy_prefix set to false. With --add_dummy_prefix set to true, extra whitespace tokens will appear. Therefore, GPT2JapaneseTokenizer has a workaround for this problem.~~
  2. ~~Removed do_lower_case option. It is because do_lower_case option was not working in your pretraining code.~~
  3. Enabled do_lower_case option.

azonti avatar Aug 10 '22 23:08 azonti

I re-enabled do_lower_case option, because a lot of <unk> tokens appear without this option (why?).

azonti avatar Aug 12 '22 20:08 azonti

@azonti Sorry for this late reply and thanks for the PR.

I notice that in the workaround code for the whitespace prefix issue, add_dummy_prefix is explicitly set to False. It can sometimes result in different tokenization results when compared with the default T5Tokenizer, for example

text = "おはようございます"
a_t5_tokenizer.tokenize(text)  # ['▁お', '時間', 'ありがとう', 'ご', 'ざい', 'ます']
a_new_tokenizer.tokenize(text)  # ['お', '時間', 'ありがとう', 'ご', 'ざい', 'ます']

It might be okay if one is using this code for pretraining from scratch. But if someone is using this code to finetune a pretrained model, this behaviour difference will make the tokenization of finetuning data (which eliminates prefix whitespaces) inconsistent with pretraining data (which keeps prefix whitespaces).

Probably we can find a better way than changing add_dummy_prefix.

ZHAOTING avatar Aug 23 '22 09:08 ZHAOTING

As with the do_lower_case problem, it is due to that the training data of this tokenizer does not contain any upper-cased letters. If we do not enable do_lower_case as a preprocessing, all the upper-cased letters in inputs will result in <unk>.

ZHAOTING avatar Aug 23 '22 09:08 ZHAOTING

OK, let's get the problems straight.

add_dummy_prefix Problem

  1. In languages without interword whitespaces, such as Chinese or Japanese, you should have pretrained with add_dummy_prefix set to disable. This can no longer be fixed.
  2. I think they should set add_dummy_prefix to disable for inference or finetuning, even though it will be inconsistent with pretraining configuration, in order to prevent the insertion of unnecessary whitespaces during tokenization. I think results will be robust enough, even under the inconsistency.

do_lower_case Problem

  1. Why does not the training data of the tokenizer contain any upper-cased letters? (manually lowercased?)

azonti avatar Aug 24 '22 16:08 azonti

In languages without interword whitespaces, such as Chinese or Japanese, you should have pretrained with add_dummy_prefix set to disable. This can no longer be fixed.

I totally agree with it.

Why does not the training data of the tokenizer contain any upper-cased letters? (manually lowercased?)

I do not remember the exact reason. But looking back at it, it was just a really bad decision.

Regarding the above two points, I think it is better to provide a better pretrained tokenizer (and possibly corresponding new pretrained models). This is going to happen in a near future.

ZHAOTING avatar Aug 25 '22 06:08 ZHAOTING

I think they should set add_dummy_prefix to disable for inference or finetuning, even though it will be inconsistent with pretraining configuration, in order to prevent the insertion of unnecessary whitespaces during tokenization.

Sorry, but I think keeping consistency is more important than sometimes skipping a few steps to fix [MASK] token.

I think results will be robust enough, even under the inconsistency.

The downside of using the current tokenizer is completely known (which is the [MASK] problem), and we know how to solve it.

But the downside of enforcing add_dummy_prefix = True cannot be foreseen. It could cause performance degradation in various downstream tasks. If we are saying "it is going to be robust", I believe we need tests to verify the hypothesis.

ZHAOTING avatar Aug 25 '22 06:08 ZHAOTING

OK, I agree with your opinion, then I removed the workaround for add_dummy_prefix.

azonti avatar Aug 27 '22 00:08 azonti