ChatGLM-6B
ChatGLM-6B copied to clipboard
[BUG/Help] 扩充词表出现AttributeError: 'ChatGLMTokenizer' object has no attribute 'vocab_file'
Is there an existing issue for this?
- [X] I have searched the existing issues
Current Behavior
Cell In[12], line 8
6 f.write(chatglm2_spm.SerializeToString())
7 tokenizer = tokenization_chatglm.ChatGLMTokenizer(vocab_file=output_sp_dir+'/tokenizer.model')
----> 8 tokenizer.save_pretrained(output_hf_dir, vocab_file=output_sp_dir+'/tokenizer.model')
9 print(f"Chinese-chatglm2 tokenizer has been saved to {output_hf_dir}")
12 # Test
File c:\ProgramData\Anaconda3\envs\tokenizer\lib\site-packages\transformers\tokenization_utils_base.py:2205, in PreTrainedTokenizerBase.save_pretrained(self, save_directory, legacy_format, filename_prefix, push_to_hub, **kwargs) 2201 logger.info(f"Special tokens file saved in {special_tokens_map_file}") 2203 file_names = (tokenizer_config_file, special_tokens_map_file) -> 2205 save_files = self._save_pretrained( 2206 save_directory=save_directory, 2207 file_names=file_names, 2208 legacy_format=legacy_format, 2209 filename_prefix=filename_prefix, 2210 ) 2212 if push_to_hub: 2213 self._upload_modified_files( 2214 save_directory, 2215 repo_id, (...) 2218 token=kwargs.get("use_auth_token"), 2219 ) ... --> 137 with open(self.vocab_file, 'rb') as fin: 138 proto_str = fin.read() 140 with open(vocab_file, "wb") as writer:
Expected Behavior
直接保存成功
Steps To Reproduce
通过将每行保存一个词的txt词典导入python并且插入chatglm tokenizer,之后保存的情况下出现该问题
Environment
- OS:windows10 conda
- Python:3.10
- Transformers:4.30.2
- PyTorch:2.0.1
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :cpu
Anything else?
No response
扩充词表之后,tokenizer得add_new_token,还得更新embedding层,具体查询huggingface的tokenizer相关内容。
请问一下我在运行cli_demo.py报如下错误:AttributeError: 'ChatGLMTokenizer' object has no attribute 'sp_tokenizer'. Did you mean: '_tokenize'?,我也重新加载了模型还是这样,这是什么情况
请问一下我在运行cli_demo.py报如下错误:AttributeError: 'ChatGLMTokenizer' object has no attribute 'sp_tokenizer'. Did you mean: '_tokenize'?,我也重新加载了模型还是这样,这是什么情况
Resolved! The problem occurs because the self.sp_tokenizer
is set after calling super.__init__()
. Specifically, looking at the error information, it is found that super().__init__()
calls the _add_tokens
method in the parent class, which goes on to call the self.get_vocab
method. The get_vocab
method is overridden in the subclass ChatGLMTokenizer
, and self.sp_tokenizer
is used in the subclass's get_vocab
method. However, at this time, self.sp_tokenizer
is not defined.
the solution is set self.sp_tokenizer
before super().__init__()
.
before
class ChatGLMTokenizer(PretrainedTokenizer):
...
def __init__(...) -> None:
super().__init__(...)
...
self.sp_tokenizer = SPTokenizer(vocab_file, num_image_tokens=num_image_tokens)
after
class ChatGLMTokenizer(PretrainedTokenizer):
...
def __init__(...) -> None:
self.sp_tokenizer = SPTokenizer(vocab_file, num_image_tokens=num_image_tokens)
super().__init__(...)
...
# self.sp_tokenizer = SPTokenizer(vocab_file, num_image_tokens=num_image_tokens)
This error does not appear in transformers==4.33.0, but it is reported in the latest version 4.40.2, which is related to the update of the PretrainedTokenizer class.
details in https://zhuanlan.zhihu.com/p/697342575