pytorch_bert_japanese icon indicating copy to clipboard operation
pytorch_bert_japanese copied to clipboard

Add is_tokenized param to be able to skip a tokenizing process optimally

Open Lyuji282 opened this issue 5 years ago • 2 comments

A developer wants to separate tokenizing processes and getting embeddings, thus I implement is_tokenized flag.

Lyuji282 avatar Jun 10 '19 03:06 Lyuji282

Thank you for your PR. This code is optimized for the "BERT日本語Pretrainedモデル". It is trained with Juman++ and not supposed to use other tokenizers. I also think it isn't a good idea to use is_tokenized param because text augument is originally string typed but it needs to be list type when is_tokenized is True.

yagays avatar Jun 10 '19 08:06 yagays

Thank you for replying me. Umm I know that the type of tokenizer is fixed on a BERT pretrained model. I just want to separate tokenization and model-applying server. Certainly, double types of argument is not a good idea, however, list argument is better as like bert-as-service.

Lyuji282 avatar Jun 11 '19 00:06 Lyuji282