EVA Huggingface | Cannot Load Vocabulary due to Decoding Problem

Huggingface | Cannot Load Vocabulary due to Decoding Problem

Open Crysflair opened this issue 2 years ago • 2 comments

I converted EVA2.0 checkpoint to huggingface format and tried to run src/example.py with PATH_TO_EVA_CHECKPOINT = "/home/.../pytorch_model.bin".

Then I recieved something like this:

Traceback (most recent call last):
  File "example.py", line 23, in <module>
    main()
  File "example.py", line 9, in main
    tokenizer = EVATokenizer.from_pretrained(PATH_CHECKPOINT)
  File "/data/g/.conda/envs/eva/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1775, in from_pretrained
    return cls._from_pretrained(
  File "/data/g/.conda/envs/eva/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1930, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/g/EVA/EVA/src/model/tokenization_eva.py", line 118, in __init__
    self.encoder = load_vocab(vocab_file)
  File "/home/g/EVA/EVA/src/model/tokenization_eva.py", line 44, in load_vocab
    line = reader.readline()
  File "/data/g/.conda/envs/eva/lib/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte

GBK encoding does not work either.

Oct 10 '22 10:10 Crysflair

The converted pytorch_model.bin file looks like this:

$ head pytorch_model.bin

shared.weightqctorch._utilsZZZZZ�}q(X
_rebuild_tensor_v2
q((Xstorageqctorch
HalfStorage
qX0qXcpuq���tqQKM0u�K�q �ccollections
OrderedDict
q
)Rq
   tq
Xlm_head.weightqh((hhhh���tqQJ��M0u�qK�q�h
)RqtqRqXncoder.embed_tokens.weightqh((hhhh���tqQKM0u�qK�q�h
)RqtqRqtqQK�qK�q �h

Oct 10 '22 11:10 Crysflair

The tokenizer should load from the vocab.txt file, not the pytorch_modeo.bin file. You can try downloading the files from the HuggingFace repo and set PATH_CHECKPOINT to the directory that contains the downloaded files.

Oct 10 '22 15:10 t1101675

EVA EVA copied to clipboard

Huggingface | Cannot Load Vocabulary due to Decoding Problem

EVA
EVA copied to clipboard