NLP_related_projects icon indicating copy to clipboard operation
NLP_related_projects copied to clipboard

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd5 in position 144: invalid continuation byte

Open NoahZhao opened this issue 2 years ago • 1 comments

请问这个如何解决 Traceback (most recent call last): File "D:/down/NLP_related_projects-master/BERT/Bert_sim/run_similarity.py", line 716, in sim = BertSim() File "D:/down/NLP_related_projects-master/BERT/Bert_sim/run_similarity.py", line 141, in init self.tokenizer = tokenization.FullTokenizer(vocab_file=cf.vocab_file, do_lower_case=True) File "D:\down\NLP_related_projects-master\BERT\Bert_sim\bert_model\tokenization.py", line 165, in init self.vocab = load_vocab(vocab_file) File "D:\down\NLP_related_projects-master\BERT\Bert_sim\bert_model\tokenization.py", line 127, in load_vocab token = convert_to_unicode(reader.readline()) File "D:\Anaconda3\envs\tf2\lib\site-packages\tensorflow\python\lib\io\file_io.py", line 169, in readline self._preread_check() File "D:\Anaconda3\envs\tf2\lib\site-packages\tensorflow\python\lib\io\file_io.py", line 79, in _preread_check self.__name, 1024 * 512) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd5 in position 144: invalid continuation byte

Process finished with exit code 1

NoahZhao avatar Oct 06 '22 12:10 NoahZhao

tokenization.py中修改文件打开方式 with open(vocab_file) as reader: # with tf.gfile.GFile(vocab_file, 'r') as reader:

geway avatar Nov 01 '22 14:11 geway

Thanks

NoahZhao avatar Jan 20 '23 14:01 NoahZhao