EVA
EVA copied to clipboard
Huggingface | Cannot Load Vocabulary due to Decoding Problem
I converted EVA2.0 checkpoint to huggingface format and tried to run src/example.py
with PATH_TO_EVA_CHECKPOINT = "/home/.../pytorch_model.bin"
.
Then I recieved something like this:
Traceback (most recent call last):
File "example.py", line 23, in <module>
main()
File "example.py", line 9, in main
tokenizer = EVATokenizer.from_pretrained(PATH_CHECKPOINT)
File "/data/g/.conda/envs/eva/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1775, in from_pretrained
return cls._from_pretrained(
File "/data/g/.conda/envs/eva/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1930, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/home/g/EVA/EVA/src/model/tokenization_eva.py", line 118, in __init__
self.encoder = load_vocab(vocab_file)
File "/home/g/EVA/EVA/src/model/tokenization_eva.py", line 44, in load_vocab
line = reader.readline()
File "/data/g/.conda/envs/eva/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte
GBK encoding does not work either.
The converted pytorch_model.bin
file looks like this:
$ head pytorch_model.bin
shared.weightqctorch._utilsZZZZZ�}q(X
_rebuild_tensor_v2
q((Xstorageqctorch
HalfStorage
qX0qXcpuq���tqQKM0u�K�q �ccollections
OrderedDict
q
)Rq
tq
Xlm_head.weightqh((hhhh���tqQJ��M0u�qK�q�h
)RqtqRqXncoder.embed_tokens.weightqh((hhhh���tqQKM0u�qK�q�h
)RqtqRqtqQK�qK�q �h
The tokenizer should load from the vocab.txt
file, not the pytorch_modeo.bin
file. You can try downloading the files from the HuggingFace repo and set PATH_CHECKPOINT
to the directory that contains the downloaded files.