有人 train 成功了吗?
System Info / 系統信息
Transformer 4.43, 4.44, 4.33 都试了,modeling_chatglm.py 也替换了,运行最后的 .sh 文件是报了和其他人类似的错。 建议官方再把训练操作过程写的详细些。
Who can help? / 谁可以帮助到您?
。
Information / 问题信息
- [X] The official example scripts / 官方的示例脚本
- [ ] My own modified scripts / 我自己修改的脚本和任务
Reproduction / 复现过程
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.735379934310913 seconds
Traceback (most recent call last):
File "/root/AI4E/ljc/LongWriter/train/main.py", line 130, in
Expected behavior / 期待表现
。
在deepspeed config里将stage3_prefetch_bucket_size设为15099494试试呢?
在deepspeed config里将stage3_prefetch_bucket_size设为15099494试试呢?
可以,但是会报新错误: RuntimeError: shape '[32768, -1, 1, 32, 2]' is invalid for input of size 524288
官方人员检查一下 tokenizer 吧
我已经把官方的方法都试过了,现在我的情况是:
- transformers==4.33.0
- pytorch==2.2.0
- /patch/modeling_chatglm.py 已替换 /root/AI4E/share/glm-4-9b/modeling_chatglm.py 但是运行的时候会报一个 KeyError: '<|endoftext|>',所以我认为是 tokenizer 的问题。
官方人员检查一下 tokenizer 吧 from transformers import AutoTokenizer, AutoModelForCausalLM import torch path = "/root/AI4E/share/glm-4-9b" tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
`--------------------------------------------------------------------------- KeyError Traceback (most recent call last) Cell In[6], line 2 1 path = "/root/AI4E/share/glm-4-9b" ----> 2 tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
File ~/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py:723, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs) 721 if os.path.isdir(pretrained_model_name_or_path): 722 tokenizer_class.register_for_auto_class() --> 723 return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs) 724 elif config_tokenizer_class is not None: 725 tokenizer_class = None
File ~/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1854, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, cache_dir, force_download, local_files_only, token, revision, *init_inputs, **kwargs) 1851 else: 1852 logger.info(f"loading file {file_path} from cache at {resolved_vocab_files[file_id]}") -> 1854 return cls._from_pretrained( 1855 resolved_vocab_files, 1856 pretrained_model_name_or_path, 1857 init_configuration, 1858 *init_inputs, 1859 token=token, 1860 cache_dir=cache_dir, 1861 local_files_only=local_files_only, 1862 _commit_hash=commit_hash, 1863 _is_local=is_local, 1864 **kwargs, 1865 )
File ~/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:2090, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, token, cache_dir, local_files_only, _commit_hash, _is_local, *init_inputs, **kwargs) 2087 tokenizer.add_tokens(tokens, special_tokens=is_last_special) 2089 # Check all our special tokens are registered as "no split" token (we don't cut them) and are in the vocab -> 2090 added_tokens = tokenizer.sanitize_special_tokens() 2091 if added_tokens: 2092 logger.warning_advice( 2093 "Special tokens have been added in the vocabulary, make sure the associated word embeddings are" 2094 " fine-tuned or trained." 2095 )
File ~/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:861, in SpecialTokensMixin.sanitize_special_tokens(self)
851 def sanitize_special_tokens(self) -> int:
852 """
853 Make sure that all the special tokens attributes of the tokenizer (tokenizer.mask_token,
854 tokenizer.cls_token, etc.) are in the vocabulary.
(...)
859 int: The number of tokens added in the vocabulary during the operation.
860 """
--> 861 return self.add_tokens(self.all_special_tokens_extended, special_tokens=True)
File ~/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1004, in SpecialTokensMixin.add_tokens(self, new_tokens, special_tokens) 1001 if not isinstance(new_tokens, (list, tuple)): 1002 new_tokens = [new_tokens] -> 1004 return self._add_tokens(new_tokens, special_tokens=special_tokens)
File ~/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils.py:421, in PreTrainedTokenizer._add_tokens(self, new_tokens, special_tokens) 417 if not special_tokens and hasattr(self, "do_lower_case") and self.do_lower_case: 418 token = token.lower() 419 if ( 420 token != self.unk_token --> 421 and self.convert_tokens_to_ids(token) == self.convert_tokens_to_ids(self.unk_token) 422 and token not in tokens_to_add 423 ): 424 tokens_to_add.append(token) 425 if self.verbose:
File ~/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils.py:582, in PreTrainedTokenizer.convert_tokens_to_ids(self, tokens) 579 return None 581 if isinstance(tokens, str): --> 582 return self._convert_token_to_id_with_added_voc(tokens) 584 ids = [] 585 for token in tokens:
File ~/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils.py:595, in PreTrainedTokenizer._convert_token_to_id_with_added_voc(self, token) 593 if token in self.added_tokens_encoder: 594 return self.added_tokens_encoder[token] --> 595 return self._convert_token_to_id(token)
File ~/.cache/huggingface/modules/transformers_modules/glm-4-9b/tokenization_chatglm.py:96, in ChatGLM4Tokenizer._convert_token_to_id(self, token) 94 def _convert_token_to_id(self, token): 95 """ Converts a token (str) in an id using the vocab. """ ---> 96 return self.mergeable_ranks[token]
KeyError: '<|endoftext|>'`
附上运行 ' ./scripts/glm4_longwriter.sh' 时的报错信息:
KeyError: '<|endoftext|>'
Using unk_token, but it is not set yet.
Traceback (most recent call last):
File "/root/AI4E/ljc/LongWriter/train/main.py", line 139, in
在deepspeed config里将stage3_prefetch_bucket_size设为15099494试试呢?
可以,但是会报新错误: RuntimeError: shape '[32768, -1, 1, 32, 2]' is invalid for input of size 524288
我遇到了跟你一模一样的错误: Traceback of TorchScript (most recent call last): File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b/modeling_chatglm.py", line 145, in apply_rotary_pos_emb rope_cache = rope_cache[:sq] xshaped = x.reshape(sq, -1, np, rot_dim // 2, 2) rope_cache = rope_cache.view(sq, -1, 1, xshaped.size(3), 2)
x_out2 = torch.stack(
[
RuntimeError: shape '[32768, -1, 1, 32, 2]' is invalid for input of size 524288
附上运行 ' ./scripts/glm4_longwriter.sh' 时的报错信息:
KeyError: '<|endoftext|>' Using unk_token, but it is not set yet. Traceback (most recent call last): File "/root/AI4E/ljc/LongWriter/train/main.py", line 139, in train() File "/root/AI4E/ljc/LongWriter/train/main.py", line 121, in train tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py", line 723, in from_pretrained return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 1854, in from_pretrained return cls._from_pretrained( ^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2090, in _from_pretrained added_tokens = tokenizer.sanitize_special_tokens() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 861, in sanitize_special_tokens return self.add_tokens(self.all_special_tokens_extended, special_tokens=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 1004, in add_tokens return self._add_tokens(new_tokens, special_tokens=special_tokens) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 421, in _add_tokens and self.convert_tokens_to_ids(token) == self.convert_tokens_to_ids(self.unk_token) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 582, in convert_tokens_to_ids return self._convert_token_to_id_with_added_voc(tokens) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 595, in _convert_token_to_id_with_added_voc return self._convert_token_to_id(token) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b/tokenization_chatglm.py", line 96, in _convert_token_to_id return self.mergeable_ranks[token] ~~~~~~~~~~~~~~~~~~~~^^^^^^^ KeyError: '<|endoftext|>' [2024-08-29 07:53:56,997] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528556 [2024-08-29 07:53:56,997] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528557 [2024-08-29 07:53:57,347] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528558 [2024-08-29 07:53:58,671] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528559 [2024-08-29 07:53:58,689] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528560 [2024-08-29 07:53:58,698] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528561 [2024-08-29 07:53:58,706] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528562 [2024-08-29 07:53:58,720] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528563 [2024-08-29 07:53:58,732] [ERROR] [launch.py:325:sigkill_handler] ['/root/anaconda3/envs/glm-4-copy/bin/python', '-u', 'main.py', '--local_rank=7', '--model_name_or_path', '/root/AI4E/share/glm-4-9b', '--train_file', './data/glm4/longwriter', '--output_dir', './output/glm4/longwriter', '--num_train_epochs', '4', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '1', '--save_strategy', 'steps', '--save_steps', '400', '--save_total_limit', '10', '--preprocessing_num_workers', '64', '--learning_rate', '1e-5', '--weight_decay', '0.1', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_dir', './logs/', '--deepspeed', 'ds_config/stage3.json', '--bf16', '--gradient_checkpointing', '1', '--adam_beta1', '0.9', '--adam_beta2', '0.95', '--report_to', 'wandb', '--run_name', 'glm4_longwriter', '--logging_steps', '1', '--batch_method', 'pack', '--pack_loss'] exits with return code = 1
你好,请用LongWriter-glm4-9b的tokenizer代码,目前的训练代码没有支持最新版GLM-4-9b的tokenizer。
附上运行 ' ./scripts/glm4_longwriter.sh' 时的报错信息:
KeyError: '<|endoftext|>' Using unk_token, but it is not set yet. Traceback (most recent call last): File "/root/AI4E/ljc/LongWriter/train/main.py", line 139, in train() File "/root/AI4E/ljc/LongWriter/train/main.py", line 121, in train tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py", line 723, in from_pretrained return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 1854, in from_pretrained return cls._from_pretrained( ^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2090, in _from_pretrained added_tokens = tokenizer.sanitize_special_tokens() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 861, in sanitize_special_tokens return self.add_tokens(self.all_special_tokens_extended, special_tokens=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 1004, in add_tokens return self._add_tokens(new_tokens, special_tokens=special_tokens) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 421, in _add_tokens and self.convert_tokens_to_ids(token) == self.convert_tokens_to_ids(self.unk_token) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 582, in convert_tokens_to_ids return self._convert_token_to_id_with_added_voc(tokens) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 595, in _convert_token_to_id_with_added_voc return self._convert_token_to_id(token) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b/tokenization_chatglm.py", line 96, in _convert_token_to_id return self.mergeable_ranks[token] ~~~~~~~~~~~~~~~~~~~~^^^^^^^ KeyError: '<|endoftext|>' [2024-08-29 07:53:56,997] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528556 [2024-08-29 07:53:56,997] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528557 [2024-08-29 07:53:57,347] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528558 [2024-08-29 07:53:58,671] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528559 [2024-08-29 07:53:58,689] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528560 [2024-08-29 07:53:58,698] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528561 [2024-08-29 07:53:58,706] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528562 [2024-08-29 07:53:58,720] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528563 [2024-08-29 07:53:58,732] [ERROR] [launch.py:325:sigkill_handler] ['/root/anaconda3/envs/glm-4-copy/bin/python', '-u', 'main.py', '--local_rank=7', '--model_name_or_path', '/root/AI4E/share/glm-4-9b', '--train_file', './data/glm4/longwriter', '--output_dir', './output/glm4/longwriter', '--num_train_epochs', '4', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '1', '--save_strategy', 'steps', '--save_steps', '400', '--save_total_limit', '10', '--preprocessing_num_workers', '64', '--learning_rate', '1e-5', '--weight_decay', '0.1', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_dir', './logs/', '--deepspeed', 'ds_config/stage3.json', '--bf16', '--gradient_checkpointing', '1', '--adam_beta1', '0.9', '--adam_beta2', '0.95', '--report_to', 'wandb', '--run_name', 'glm4_longwriter', '--logging_steps', '1', '--batch_method', 'pack', '--pack_loss'] exits with return code = 1
你好,请用LongWriter-glm4-9b的tokenizer代码,目前的训练代码没有支持最新版GLM-4-9b的tokenizer。
RuntimeError: shape '[32768, -1, 1, 32, 2]' is invalid for input of size 524288 这个和你说的是同一个问题吗
在deepspeed config里将stage3_prefetch_bucket_size设为15099494试试呢?
可以,但是会报新错误: RuntimeError: shape '[32768, -1, 1, 32, 2]' is invalid for input of size 524288
我遇到了跟你一模一样的错误: Traceback of TorchScript (most recent call last): File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b/modeling_chatglm.py", line 145, in apply_rotary_pos_emb rope_cache = rope_cache[:sq] xshaped = x.reshape(sq, -1, np, rot_dim // 2, 2) rope_cache = rope_cache.view(sq, -1, 1, xshaped.size(3), 2)
x_out2 = torch.stack( [ RuntimeError: shape '[32768, -1, 1, 32, 2]' is invalid for input of size 524288
现在我也报这个错了
现在会报两种类型的错误
- 系统环境:
- python==3.11.9
- transformers==4.33.0
- pytorch==2.2.0
- /glm-4-9b 目录下的
modeling_chatglm.py和tokenization_chatglm.py都已经替换
错误一:RuntimeError: shape '[32768, -1, 1, 32, 2]' is invalid for input of size 524288
在 /ds_config/stage3.json 中设置 "stage3_prefetch_bucket_size": 15099494,
这样的话会一直运行到出现wandb界面,但在开始训练的时候就会报错:
^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/trainer.py", line 2679, in training_step
^RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b/modeling_chatglm.py", line 146, in apply_rotary_pos_emb
rope_cache = rope_cache[:sq]
xshaped = x.reshape(sq, -1, np, rot_dim // 2, 2)
rope_cache = rope_cache.view(sq, -1, 1, xshaped.size(3), 2)
~~~~~~~~~~~~~~~ <--- HERE
x_out2 = torch.stack(
[
RuntimeError: shape '[32768, -1, 1, 32, 2]' is invalid for input of size 524288
错误二:Input should be a valid integer, got a number with a fractional part
在 /ds_config/stage3.json 中设置 "stage3_prefetch_bucket_size": "auto",
这样设置并运行的话会在wandb出现之前就报错:
File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/deepspeed/runtime/config.py", line 817, in _initialize_params
self.zero_config = get_zero_config(param_dict)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/deepspeed/runtime/zero/config.py", line 71, in get_zero_config
return DeepSpeedZeroConfig(**zero_config_dict)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/deepspeed/runtime/config_utils.py", line 57, in __init__
super().__init__(**data)
File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/pydantic/main.py", line 193, in __init__
self.__pydantic_validator__.validate_python(data, self_instance=self)
pydantic_core._pydantic_core.ValidationError: 1 validation error for DeepSpeedZeroConfig
stage3_prefetch_bucket_size
Input should be a valid integer, got a number with a fractional part [type=int_from_float, input_value=15099494.4, input_type=float]
For further information visit https://errors.pydantic.dev/2.8/v/int_from_floa
附上运行 ' ./scripts/glm4_longwriter.sh' 时的报错信息:
KeyError: '<|endoftext|>' Using unk_token, but it is not set yet. Traceback (most recent call last): File "/root/AI4E/ljc/LongWriter/train/main.py", line 139, in train() File "/root/AI4E/ljc/LongWriter/train/main.py", line 121, in train tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py", line 723, in from_pretrained return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 1854, in from_pretrained return cls._from_pretrained( ^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2090, in _from_pretrained added_tokens = tokenizer.sanitize_special_tokens() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 861, in sanitize_special_tokens return self.add_tokens(self.all_special_tokens_extended, special_tokens=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 1004, in add_tokens return self._add_tokens(new_tokens, special_tokens=special_tokens) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 421, in _add_tokens and self.convert_tokens_to_ids(token) == self.convert_tokens_to_ids(self.unk_token) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 582, in convert_tokens_to_ids return self._convert_token_to_id_with_added_voc(tokens) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 595, in _convert_token_to_id_with_added_voc return self._convert_token_to_id(token) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b/tokenization_chatglm.py", line 96, in _convert_token_to_id return self.mergeable_ranks[token] ~~~~~~~~~~~~~~~~~~~~^^^^^^^ KeyError: '<|endoftext|>' [2024-08-29 07:53:56,997] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528556 [2024-08-29 07:53:56,997] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528557 [2024-08-29 07:53:57,347] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528558 [2024-08-29 07:53:58,671] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528559 [2024-08-29 07:53:58,689] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528560 [2024-08-29 07:53:58,698] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528561 [2024-08-29 07:53:58,706] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528562 [2024-08-29 07:53:58,720] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528563 [2024-08-29 07:53:58,732] [ERROR] [launch.py:325:sigkill_handler] ['/root/anaconda3/envs/glm-4-copy/bin/python', '-u', 'main.py', '--local_rank=7', '--model_name_or_path', '/root/AI4E/share/glm-4-9b', '--train_file', './data/glm4/longwriter', '--output_dir', './output/glm4/longwriter', '--num_train_epochs', '4', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '1', '--save_strategy', 'steps', '--save_steps', '400', '--save_total_limit', '10', '--preprocessing_num_workers', '64', '--learning_rate', '1e-5', '--weight_decay', '0.1', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_dir', './logs/', '--deepspeed', 'ds_config/stage3.json', '--bf16', '--gradient_checkpointing', '1', '--adam_beta1', '0.9', '--adam_beta2', '0.95', '--report_to', 'wandb', '--run_name', 'glm4_longwriter', '--logging_steps', '1', '--batch_method', 'pack', '--pack_loss'] exits with return code = 1
你好,请用LongWriter-glm4-9b的tokenizer代码,目前的训练代码没有支持最新版GLM-4-9b的tokenizer。
你好,请问有solution了吗,还是想跑一下训练
附上运行 ' ./scripts/glm4_longwriter.sh' 时的报错信息:
KeyError: '<|endoftext|>' Using unk_token, but it is not set yet. Traceback (most recent call last): File "/root/AI4E/ljc/LongWriter/train/main.py", line 139, in train() File "/root/AI4E/ljc/LongWriter/train/main.py", line 121, in train tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py", line 723, in from_pretrained return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 1854, in from_pretrained return cls._from_pretrained( ^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2090, in _from_pretrained added_tokens = tokenizer.sanitize_special_tokens() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 861, in sanitize_special_tokens return self.add_tokens(self.all_special_tokens_extended, special_tokens=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 1004, in add_tokens return self._add_tokens(new_tokens, special_tokens=special_tokens) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 421, in _add_tokens and self.convert_tokens_to_ids(token) == self.convert_tokens_to_ids(self.unk_token) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 582, in convert_tokens_to_ids return self._convert_token_to_id_with_added_voc(tokens) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 595, in _convert_token_to_id_with_added_voc return self._convert_token_to_id(token) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b/tokenization_chatglm.py", line 96, in _convert_token_to_id return self.mergeable_ranks[token] ~~~~~~~~~~~~~~~~~~~~^^^^^^^ KeyError: '<|endoftext|>' [2024-08-29 07:53:56,997] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528556 [2024-08-29 07:53:56,997] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528557 [2024-08-29 07:53:57,347] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528558 [2024-08-29 07:53:58,671] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528559 [2024-08-29 07:53:58,689] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528560 [2024-08-29 07:53:58,698] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528561 [2024-08-29 07:53:58,706] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528562 [2024-08-29 07:53:58,720] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528563 [2024-08-29 07:53:58,732] [ERROR] [launch.py:325:sigkill_handler] ['/root/anaconda3/envs/glm-4-copy/bin/python', '-u', 'main.py', '--local_rank=7', '--model_name_or_path', '/root/AI4E/share/glm-4-9b', '--train_file', './data/glm4/longwriter', '--output_dir', './output/glm4/longwriter', '--num_train_epochs', '4', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '1', '--save_strategy', 'steps', '--save_steps', '400', '--save_total_limit', '10', '--preprocessing_num_workers', '64', '--learning_rate', '1e-5', '--weight_decay', '0.1', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_dir', './logs/', '--deepspeed', 'ds_config/stage3.json', '--bf16', '--gradient_checkpointing', '1', '--adam_beta1', '0.9', '--adam_beta2', '0.95', '--report_to', 'wandb', '--run_name', 'glm4_longwriter', '--logging_steps', '1', '--batch_method', 'pack', '--pack_loss'] exits with return code = 1
你好,请用LongWriter-glm4-9b的tokenizer代码,目前的训练代码没有支持最新版GLM-4-9b的tokenizer。
你好,请问有solution了吗,还是想跑一下训练
你好,从报错信息看代码运行时用的还是glm-4-9b原本的tokenization_chatglm.py,并不是LongWriter-glm4-9b的tokenization_chatglm.py。请确认main.py里model和tokenizer载入时是否加了trust_remote_code=True。
现在会报两种类型的错误
系统环境:
- python==3.11.9
- transformers==4.33.0
- pytorch==2.2.0
- /glm-4-9b 目录下的
modeling_chatglm.py和tokenization_chatglm.py都已经替换错误一:RuntimeError: shape '[32768, -1, 1, 32, 2]' is invalid for input of size 524288
在 /ds_config/stage3.json 中设置
"stage3_prefetch_bucket_size": 15099494,这样的话会一直运行到出现wandb界面,但在开始训练的时候就会报错:
^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/trainer.py", line 2679, in training_step ^RuntimeError: The following operation failed in the TorchScript interpreter. Traceback of TorchScript (most recent call last): File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b/modeling_chatglm.py", line 146, in apply_rotary_pos_emb rope_cache = rope_cache[:sq] xshaped = x.reshape(sq, -1, np, rot_dim // 2, 2) rope_cache = rope_cache.view(sq, -1, 1, xshaped.size(3), 2) ~~~~~~~~~~~~~~~ <--- HERE x_out2 = torch.stack( [ RuntimeError: shape '[32768, -1, 1, 32, 2]' is invalid for input of size 524288错误二:Input should be a valid integer, got a number with a fractional part
在 /ds_config/stage3.json 中设置
"stage3_prefetch_bucket_size": "auto",这样设置并运行的话会在wandb出现之前就报错:
File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/deepspeed/runtime/config.py", line 817, in _initialize_params self.zero_config = get_zero_config(param_dict) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/deepspeed/runtime/zero/config.py", line 71, in get_zero_config return DeepSpeedZeroConfig(**zero_config_dict) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/deepspeed/runtime/config_utils.py", line 57, in __init__ super().__init__(**data) File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/pydantic/main.py", line 193, in __init__ self.__pydantic_validator__.validate_python(data, self_instance=self) pydantic_core._pydantic_core.ValidationError: 1 validation error for DeepSpeedZeroConfig stage3_prefetch_bucket_size Input should be a valid integer, got a number with a fractional part [type=int_from_float, input_value=15099494.4, input_type=float] For further information visit https://errors.pydantic.dev/2.8/v/int_from_floa
对于错误二,请把"stage3_prefetch_bucket_size": "auto"改为15099494。
从 https://github.com/hiyouga/LLaMA-Factory/issues/5252 这个issue来看,"stage3_prefetch_bucket_size": "auto"报错可以通过降低deepspeed版本解决,试试pip install deepspeed==0.14.4。
@LYCnight @badarrrr 请看我们在README中的FAQ是否能解决你们遇到的问题。不好意思让你们久等了。
@LYCnight @badarrrr 请看我们在README中的FAQ是否能解决你们遇到的问题。不好意思让你们久等了。
非常感谢!我已经train成功了,分享一些经验:https://github.com/THUDM/LongWriter/issues/25