cdQA icon indicating copy to clipboard operation
cdQA copied to clipboard

How to tokenize chinese sentence with pytorch

Open weinixuehao opened this issue 5 years ago • 1 comments

image

image I feeling this output with UNK symbol is not correct. Did cdQA support Chinese tokenize ?

I have seen run_squad.py in bert which support chinese tokenize with whitespace and vocab

weinixuehao avatar Dec 11 '19 02:12 weinixuehao

Hi @weinixuehao

Which pre-trained bert model are you using? Ideally you want to use bert-base-chinese that has a chinese tokenizer included. See https://github.com/huggingface/transformers/blob/94c99db34cf9074a212c36554fb925c513d70ab1/transformers/tokenization_bert.py#L40

fmikaelian avatar Dec 22 '19 15:12 fmikaelian