Riccardo Orlando
Riccardo Orlando
The Multilingual model `xx_sent_ud_sm` does not tokenize correctly Chinese sentences, while the Chinese model `zh_core_web_sm` does. For example: ```python import spacy nlp_ml = spacy.load("xx_sent_ud_sm") nlp_ml.tokenizer("包括联合国机构和机制提出的有关建议以及现有的外部资料对有关国家进行筹备性研究。") # ['包括联合国机构和机制提出的有关建议以及现有的外部资料对有关国家进行筹备性研究', '。'] nlp_zh= spacy.load("zh_core_web_sm")...
> @Riccorl : This is the expected behavior for the base `xx` tokenizer used in that model, which just doesn't work for languages without whitespace between tokens. It was a...
It seems it doesn't work with 1.13. It works with 1.12 though.
Do you have the conll-2012?
I have the same problem, did you solve it?
@sshleifer It seems like the problem is not the num_candidates=1. The model sees a binary classification task when there is only one candidate in the utterance sample.