cdQA
cdQA copied to clipboard
How to tokenize chinese sentence with pytorch
I feeling this output with UNK symbol is not correct.
Did cdQA support Chinese tokenize ?
I have seen run_squad.py in bert which support chinese tokenize with whitespace and vocab
Hi @weinixuehao
Which pre-trained bert model are you using? Ideally you want to use bert-base-chinese
that has a chinese tokenizer included. See https://github.com/huggingface/transformers/blob/94c99db34cf9074a212c36554fb925c513d70ab1/transformers/tokenization_bert.py#L40