ChatGLM-6B [BUG/Help] 扩充词表出现AttributeError: 'ChatGLMTokenizer' object has no attribute 'vocab

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

Cell In[12], line 8 6 f.write(chatglm2_spm.SerializeToString()) 7 tokenizer = tokenization_chatglm.ChatGLMTokenizer(vocab_file=output_sp_dir+'/tokenizer.model') ----> 8 tokenizer.save_pretrained(output_hf_dir, vocab_file=output_sp_dir+'/tokenizer.model')
9 print(f"Chinese-chatglm2 tokenizer has been saved to {output_hf_dir}") 12 # Test

File c:\ProgramData\Anaconda3\envs\tokenizer\lib\site-packages\transformers\tokenization_utils_base.py:2205, in PreTrainedTokenizerBase.save_pretrained(self, save_directory, legacy_format, filename_prefix, push_to_hub, **kwargs) 2201 logger.info(f"Special tokens file saved in {special_tokens_map_file}") 2203 file_names = (tokenizer_config_file, special_tokens_map_file) -> 2205 save_files = self._save_pretrained( 2206 save_directory=save_directory, 2207 file_names=file_names, 2208 legacy_format=legacy_format, 2209 filename_prefix=filename_prefix, 2210 ) 2212 if push_to_hub: 2213 self._upload_modified_files( 2214 save_directory, 2215 repo_id, (...) 2218 token=kwargs.get("use_auth_token"), 2219 ) ... --> 137 with open(self.vocab_file, 'rb') as fin: 138 proto_str = fin.read() 140 with open(vocab_file, "wb") as writer:

Expected Behavior

直接保存成功

Steps To Reproduce

通过将每行保存一个词的txt词典导入python并且插入chatglm tokenizer,之后保存的情况下出现该问题

Environment

- OS:windows10 conda
- Python:3.10
- Transformers:4.30.2
- PyTorch:2.0.1
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :cpu

Anything else?

No response

Jul 07 '23 09:07 comymh

扩充词表之后，tokenizer得add_new_token,还得更新embedding层，具体查询huggingface的tokenizer相关内容。

Jul 23 '23 17:07 tomcat123a

请问一下我在运行cli_demo.py报如下错误：AttributeError: 'ChatGLMTokenizer' object has no attribute 'sp_tokenizer'. Did you mean: '_tokenize'?，我也重新加载了模型还是这样，这是什么情况

Apr 30 '24 06:04 ttj666

请问一下我在运行cli_demo.py报如下错误：AttributeError: 'ChatGLMTokenizer' object has no attribute 'sp_tokenizer'. Did you mean: '_tokenize'?，我也重新加载了模型还是这样，这是什么情况

Resolved! The problem occurs because the self.sp_tokenizer is set after calling super.__init__(). Specifically, looking at the error information, it is found that super().__init__() calls the _add_tokens method in the parent class, which goes on to call the self.get_vocab method. The get_vocab method is overridden in the subclass ChatGLMTokenizer, and self.sp_tokenizer is used in the subclass's get_vocab method. However, at this time, self.sp_tokenizer is not defined.

the solution is set self.sp_tokenizer before super().__init__().

before

class ChatGLMTokenizer(PretrainedTokenizer):
    ...
    def __init__(...) -> None:
        super().__init__(...)
        ...
        self.sp_tokenizer = SPTokenizer(vocab_file, num_image_tokens=num_image_tokens)

after

class ChatGLMTokenizer(PretrainedTokenizer):
    ...
    def __init__(...) -> None:
        self.sp_tokenizer = SPTokenizer(vocab_file, num_image_tokens=num_image_tokens)
        super().__init__(...)
        ...
        # self.sp_tokenizer = SPTokenizer(vocab_file, num_image_tokens=num_image_tokens)

This error does not appear in transformers==4.33.0, but it is reported in the latest version 4.40.2, which is related to the update of the PretrainedTokenizer class.

details in https://zhuanlan.zhihu.com/p/697342575

May 12 '24 11:05 WangYangfan

ChatGLM-6B
ChatGLM-6B copied to clipboard

[BUG/Help] 扩充词表出现AttributeError: 'ChatGLMTokenizer' object has no attribute 'vocab_file'

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Anything else?

ChatGLM-6B ChatGLM-6B copied to clipboard

[BUG/Help] 扩充词表出现AttributeError: 'ChatGLMTokenizer' object has no attribute 'vocab_file'

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Anything else?

ChatGLM-6B
ChatGLM-6B copied to clipboard