japanese-pretrained-models
japanese-pretrained-models copied to clipboard
add tokenizer & model
Adding tokenizers and a modeling file so that your models work without special tips, except for [MASK]
problem.
For rinna/japanese-roberta-base
:
Adding tokenization_roberta_japanese.py
, modeling_roberta_japanese.py
, and modeling_tf_roberta_japanese.py
.
The difference between T5Tokenizer
and RobertaJapaneseTokenizer
-
RobertaJapaneseTokenizer
will add[CLS]
automatically. - ~~In languages without inter-word whitespaces, such as Japanese and Chinese, you should have trained SentencePiece with
--add_dummy_prefix
set to false. With--add_dummy_prefix
set to true, extra whitespace tokens will appear. This is why A) Directly typing[MASK]
in an input string and B) replacing a token with[MASK]
after tokenization will yield different token sequences. Therefore,RobertaJapaneseTokenizer
has a workaround for this problem.~~ - ~~Removed
do_lower_case
option. It is becausedo_lower_case
option was not working in your pretraining code.~~ -
Enabled
do_lower_case
option.
The difference between RobertaModel
and RobertaJapaneseModel
-
position_ids
starts with 0. Therefore, it will be no longer necessary to explicitly provideposition_ids
.
For rinna/japanese-gpt2-*
:
Adding tokenization_gpt2_japanese.py
.
The difference between T5Tokenizer
and GPT2JapaneseTokenizer
- ~~In languages without inter-word whitespaces, such as Japanese and Chinese, you should have trained SentencePiece with
--add_dummy_prefix
set to false. With--add_dummy_prefix
set to true, extra whitespace tokens will appear. Therefore,GPT2JapaneseTokenizer
has a workaround for this problem.~~ - ~~Removed
do_lower_case
option. It is becausedo_lower_case
option was not working in your pretraining code.~~ -
Enabled
do_lower_case
option.
I re-enabled do_lower_case
option, because a lot of <unk>
tokens appear without this option (why?).
@azonti Sorry for this late reply and thanks for the PR.
I notice that in the workaround code for the whitespace prefix issue, add_dummy_prefix
is explicitly set to False
.
It can sometimes result in different tokenization results when compared with the default T5Tokenizer, for example
text = "おはようございます"
a_t5_tokenizer.tokenize(text) # ['▁お', '時間', 'ありがとう', 'ご', 'ざい', 'ます']
a_new_tokenizer.tokenize(text) # ['お', '時間', 'ありがとう', 'ご', 'ざい', 'ます']
It might be okay if one is using this code for pretraining from scratch. But if someone is using this code to finetune a pretrained model, this behaviour difference will make the tokenization of finetuning data (which eliminates prefix whitespaces) inconsistent with pretraining data (which keeps prefix whitespaces).
Probably we can find a better way than changing add_dummy_prefix
.
As with the do_lower_case
problem, it is due to that the training data of this tokenizer does not contain any upper-cased letters. If we do not enable do_lower_case
as a preprocessing, all the upper-cased letters in inputs will result in <unk>
.
OK, let's get the problems straight.
add_dummy_prefix
Problem
- In languages without interword whitespaces, such as Chinese or Japanese, you should have pretrained with
add_dummy_prefix
set to disable. This can no longer be fixed. - I think they should set
add_dummy_prefix
to disable for inference or finetuning, even though it will be inconsistent with pretraining configuration, in order to prevent the insertion of unnecessary whitespaces during tokenization. I think results will be robust enough, even under the inconsistency.
do_lower_case
Problem
- Why does not the training data of the tokenizer contain any upper-cased letters? (manually lowercased?)
In languages without interword whitespaces, such as Chinese or Japanese, you should have pretrained with add_dummy_prefix set to disable. This can no longer be fixed.
I totally agree with it.
Why does not the training data of the tokenizer contain any upper-cased letters? (manually lowercased?)
I do not remember the exact reason. But looking back at it, it was just a really bad decision.
Regarding the above two points, I think it is better to provide a better pretrained tokenizer (and possibly corresponding new pretrained models). This is going to happen in a near future.
I think they should set add_dummy_prefix to disable for inference or finetuning, even though it will be inconsistent with pretraining configuration, in order to prevent the insertion of unnecessary whitespaces during tokenization.
Sorry, but I think keeping consistency is more important than sometimes skipping a few steps to fix [MASK] token.
I think results will be robust enough, even under the inconsistency.
The downside of using the current tokenizer is completely known (which is the [MASK] problem), and we know how to solve it.
But the downside of enforcing add_dummy_prefix = True
cannot be foreseen. It could cause performance degradation in various downstream tasks. If we are saying "it is going to be robust", I believe we need tests to verify the hypothesis.
OK, I agree with your opinion, then I removed the workaround for add_dummy_prefix
.