LongWriter icon indicating copy to clipboard operation
LongWriter copied to clipboard

有人 train 成功了吗?

Open LYCnight opened this issue 1 year ago • 15 comments

System Info / 系統信息

Transformer 4.43, 4.44, 4.33 都试了,modeling_chatglm.py 也替换了,运行最后的 .sh 文件是报了和其他人类似的错。 建议官方再把训练操作过程写的详细些。

Who can help? / 谁可以帮助到您?

Information / 问题信息

  • [X] The official example scripts / 官方的示例脚本
  • [ ] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

Loading extension module cpu_adam... Time to load cpu_adam op: 2.735379934310913 seconds Traceback (most recent call last): File "/root/AI4E/ljc/LongWriter/train/main.py", line 130, in train() File "/root/AI4E/ljc/LongWriter/train/main.py", line 126, in train trainer.train(resume_from_checkpoint=False) File "/root/anaconda3/envs/glm-4-copy/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train return inner_training_loop( File "/root/anaconda3/envs/glm-4-copy/lib/python3.10/site-packages/transformers/trainer.py", line 2095, in _inner_training_loop model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer) File "/root/anaconda3/envs/glm-4-copy/lib/python3.10/site-packages/accelerate/accelerator.py", line 1303, in prepare result = self._prepare_deepspeed(*args) File "/root/anaconda3/envs/glm-4-copy/lib/python3.10/site-packages/accelerate/accelerator.py", line 1779, in _prepare_deepspeed engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs) File "/root/anaconda3/envs/glm-4-copy/lib/python3.10/site-packages/deepspeed/init.py", line 179, in initialize config_class = DeepSpeedConfig(config, mpu, mesh_device=mesh_device) File "/root/anaconda3/envs/glm-4-copy/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 797, in init self._initialize_params(copy.copy(self._param_dict)) File "/root/anaconda3/envs/glm-4-copy/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 817, in _initialize_params self.zero_config = get_zero_config(param_dict) File "/root/anaconda3/envs/glm-4-copy/lib/python3.10/site-packages/deepspeed/runtime/zero/config.py", line 71, in get_zero_config return DeepSpeedZeroConfig(**zero_config_dict) File "/root/anaconda3/envs/glm-4-copy/lib/python3.10/site-packages/deepspeed/runtime/config_utils.py", line 57, in init super().init(**data) File "/root/anaconda3/envs/glm-4-copy/lib/python3.10/site-packages/pydantic/main.py", line 193, in init self.pydantic_validator.validate_python(data, self_instance=self) pydantic_core._pydantic_core.ValidationError: 1 validation error for DeepSpeedZeroConfig stage3_prefetch_bucket_size Input should be a valid integer, got a number with a fractional part [type=int_from_float, input_value=15099494.4, input_type=float] For further information visit https://errors.pydantic.dev/2.8/v/int_from_float [2024-08-28 12:38:44,068] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 282936 [2024-08-28 12:38:44,901] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 282937 [2024-08-28 12:38:46,425] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 282938 [2024-08-28 12:38:46,443] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 282939 [2024-08-28 12:38:46,452] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 282940 [2024-08-28 12:38:46,460] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 282941 [2024-08-28 12:38:46,460] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 282942 [2024-08-28 12:38:46,469] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 282943 [2024-08-28 12:38:46,478] [ERROR] [launch.py:325:sigkill_handler] ['/root/anaconda3/envs/glm-4-copy/bin/python', '-u', 'main.py', '--local_rank=7', '--model_name_or_path', '/root/AI4E/share/glm-4-9b', '--train_file', './data/glm4/longwriter', '--output_dir', './output/glm4/longwriter', '--num_train_epochs', '4', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '1', '--save_strategy', 'steps', '--save_steps', '400', '--save_total_limit', '10', '--preprocessing_num_workers', '64', '--learning_rate', '1e-5', '--weight_decay', '0.1', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_dir', './logs/', '--deepspeed', 'ds_config/stage3.json', '--bf16', '--gradient_checkpointing', '1', '--adam_beta1', '0.9', '--adam_beta2', '0.95', '--report_to', 'wandb', '--run_name', 'glm4_longwriter', '--logging_steps', '1', '--batch_method', 'pack', '--pack_loss'] exits with return code = 1

Expected behavior / 期待表现

LYCnight avatar Aug 28 '24 12:08 LYCnight

在deepspeed config里将stage3_prefetch_bucket_size设为15099494试试呢?

bys0318 avatar Aug 28 '24 12:08 bys0318

在deepspeed config里将stage3_prefetch_bucket_size设为15099494试试呢?

可以,但是会报新错误: RuntimeError: shape '[32768, -1, 1, 32, 2]' is invalid for input of size 524288

LYCnight avatar Aug 29 '24 07:08 LYCnight

官方人员检查一下 tokenizer 吧

我已经把官方的方法都试过了,现在我的情况是:

  • transformers==4.33.0
  • pytorch==2.2.0
  • /patch/modeling_chatglm.py 已替换 /root/AI4E/share/glm-4-9b/modeling_chatglm.py 但是运行的时候会报一个 KeyError: '<|endoftext|>',所以我认为是 tokenizer 的问题。

官方人员检查一下 tokenizer 吧 from transformers import AutoTokenizer, AutoModelForCausalLM import torch path = "/root/AI4E/share/glm-4-9b" tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)

`--------------------------------------------------------------------------- KeyError Traceback (most recent call last) Cell In[6], line 2 1 path = "/root/AI4E/share/glm-4-9b" ----> 2 tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)

File ~/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py:723, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs) 721 if os.path.isdir(pretrained_model_name_or_path): 722 tokenizer_class.register_for_auto_class() --> 723 return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs) 724 elif config_tokenizer_class is not None: 725 tokenizer_class = None

File ~/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1854, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, cache_dir, force_download, local_files_only, token, revision, *init_inputs, **kwargs) 1851 else: 1852 logger.info(f"loading file {file_path} from cache at {resolved_vocab_files[file_id]}") -> 1854 return cls._from_pretrained( 1855 resolved_vocab_files, 1856 pretrained_model_name_or_path, 1857 init_configuration, 1858 *init_inputs, 1859 token=token, 1860 cache_dir=cache_dir, 1861 local_files_only=local_files_only, 1862 _commit_hash=commit_hash, 1863 _is_local=is_local, 1864 **kwargs, 1865 )

File ~/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:2090, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, token, cache_dir, local_files_only, _commit_hash, _is_local, *init_inputs, **kwargs) 2087 tokenizer.add_tokens(tokens, special_tokens=is_last_special) 2089 # Check all our special tokens are registered as "no split" token (we don't cut them) and are in the vocab -> 2090 added_tokens = tokenizer.sanitize_special_tokens() 2091 if added_tokens: 2092 logger.warning_advice( 2093 "Special tokens have been added in the vocabulary, make sure the associated word embeddings are" 2094 " fine-tuned or trained." 2095 )

File ~/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:861, in SpecialTokensMixin.sanitize_special_tokens(self) 851 def sanitize_special_tokens(self) -> int: 852 """ 853 Make sure that all the special tokens attributes of the tokenizer (tokenizer.mask_token, 854 tokenizer.cls_token, etc.) are in the vocabulary. (...) 859 int: The number of tokens added in the vocabulary during the operation. 860 """ --> 861 return self.add_tokens(self.all_special_tokens_extended, special_tokens=True)

File ~/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1004, in SpecialTokensMixin.add_tokens(self, new_tokens, special_tokens) 1001 if not isinstance(new_tokens, (list, tuple)): 1002 new_tokens = [new_tokens] -> 1004 return self._add_tokens(new_tokens, special_tokens=special_tokens)

File ~/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils.py:421, in PreTrainedTokenizer._add_tokens(self, new_tokens, special_tokens) 417 if not special_tokens and hasattr(self, "do_lower_case") and self.do_lower_case: 418 token = token.lower() 419 if ( 420 token != self.unk_token --> 421 and self.convert_tokens_to_ids(token) == self.convert_tokens_to_ids(self.unk_token) 422 and token not in tokens_to_add 423 ): 424 tokens_to_add.append(token) 425 if self.verbose:

File ~/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils.py:582, in PreTrainedTokenizer.convert_tokens_to_ids(self, tokens) 579 return None 581 if isinstance(tokens, str): --> 582 return self._convert_token_to_id_with_added_voc(tokens) 584 ids = [] 585 for token in tokens:

File ~/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils.py:595, in PreTrainedTokenizer._convert_token_to_id_with_added_voc(self, token) 593 if token in self.added_tokens_encoder: 594 return self.added_tokens_encoder[token] --> 595 return self._convert_token_to_id(token)

File ~/.cache/huggingface/modules/transformers_modules/glm-4-9b/tokenization_chatglm.py:96, in ChatGLM4Tokenizer._convert_token_to_id(self, token) 94 def _convert_token_to_id(self, token): 95 """ Converts a token (str) in an id using the vocab. """ ---> 96 return self.mergeable_ranks[token]

KeyError: '<|endoftext|>'`

LYCnight avatar Aug 29 '24 07:08 LYCnight

附上运行 ' ./scripts/glm4_longwriter.sh' 时的报错信息:

KeyError: '<|endoftext|>' Using unk_token, but it is not set yet. Traceback (most recent call last): File "/root/AI4E/ljc/LongWriter/train/main.py", line 139, in train() File "/root/AI4E/ljc/LongWriter/train/main.py", line 121, in train tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py", line 723, in from_pretrained return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 1854, in from_pretrained return cls._from_pretrained( ^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2090, in _from_pretrained added_tokens = tokenizer.sanitize_special_tokens() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 861, in sanitize_special_tokens return self.add_tokens(self.all_special_tokens_extended, special_tokens=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 1004, in add_tokens return self._add_tokens(new_tokens, special_tokens=special_tokens) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 421, in _add_tokens and self.convert_tokens_to_ids(token) == self.convert_tokens_to_ids(self.unk_token) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 582, in convert_tokens_to_ids return self._convert_token_to_id_with_added_voc(tokens) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 595, in _convert_token_to_id_with_added_voc return self._convert_token_to_id(token) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b/tokenization_chatglm.py", line 96, in _convert_token_to_id return self.mergeable_ranks[token] ~~~~~~~~~~~~~~~~~~~~^^^^^^^ KeyError: '<|endoftext|>' [2024-08-29 07:53:56,997] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528556 [2024-08-29 07:53:56,997] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528557 [2024-08-29 07:53:57,347] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528558 [2024-08-29 07:53:58,671] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528559 [2024-08-29 07:53:58,689] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528560 [2024-08-29 07:53:58,698] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528561 [2024-08-29 07:53:58,706] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528562 [2024-08-29 07:53:58,720] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528563 [2024-08-29 07:53:58,732] [ERROR] [launch.py:325:sigkill_handler] ['/root/anaconda3/envs/glm-4-copy/bin/python', '-u', 'main.py', '--local_rank=7', '--model_name_or_path', '/root/AI4E/share/glm-4-9b', '--train_file', './data/glm4/longwriter', '--output_dir', './output/glm4/longwriter', '--num_train_epochs', '4', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '1', '--save_strategy', 'steps', '--save_steps', '400', '--save_total_limit', '10', '--preprocessing_num_workers', '64', '--learning_rate', '1e-5', '--weight_decay', '0.1', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_dir', './logs/', '--deepspeed', 'ds_config/stage3.json', '--bf16', '--gradient_checkpointing', '1', '--adam_beta1', '0.9', '--adam_beta2', '0.95', '--report_to', 'wandb', '--run_name', 'glm4_longwriter', '--logging_steps', '1', '--batch_method', 'pack', '--pack_loss'] exits with return code = 1

LYCnight avatar Aug 29 '24 07:08 LYCnight

在deepspeed config里将stage3_prefetch_bucket_size设为15099494试试呢?

可以,但是会报新错误: RuntimeError: shape '[32768, -1, 1, 32, 2]' is invalid for input of size 524288

我遇到了跟你一模一样的错误: Traceback of TorchScript (most recent call last): File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b/modeling_chatglm.py", line 145, in apply_rotary_pos_emb rope_cache = rope_cache[:sq] xshaped = x.reshape(sq, -1, np, rot_dim // 2, 2) rope_cache = rope_cache.view(sq, -1, 1, xshaped.size(3), 2)

x_out2 = torch.stack(
[
RuntimeError: shape '[32768, -1, 1, 32, 2]' is invalid for input of size 524288

badarrrr avatar Aug 29 '24 08:08 badarrrr

附上运行 ' ./scripts/glm4_longwriter.sh' 时的报错信息:

KeyError: '<|endoftext|>' Using unk_token, but it is not set yet. Traceback (most recent call last): File "/root/AI4E/ljc/LongWriter/train/main.py", line 139, in train() File "/root/AI4E/ljc/LongWriter/train/main.py", line 121, in train tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py", line 723, in from_pretrained return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 1854, in from_pretrained return cls._from_pretrained( ^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2090, in _from_pretrained added_tokens = tokenizer.sanitize_special_tokens() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 861, in sanitize_special_tokens return self.add_tokens(self.all_special_tokens_extended, special_tokens=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 1004, in add_tokens return self._add_tokens(new_tokens, special_tokens=special_tokens) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 421, in _add_tokens and self.convert_tokens_to_ids(token) == self.convert_tokens_to_ids(self.unk_token) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 582, in convert_tokens_to_ids return self._convert_token_to_id_with_added_voc(tokens) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 595, in _convert_token_to_id_with_added_voc return self._convert_token_to_id(token) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b/tokenization_chatglm.py", line 96, in _convert_token_to_id return self.mergeable_ranks[token] ~~~~~~~~~~~~~~~~~~~~^^^^^^^ KeyError: '<|endoftext|>' [2024-08-29 07:53:56,997] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528556 [2024-08-29 07:53:56,997] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528557 [2024-08-29 07:53:57,347] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528558 [2024-08-29 07:53:58,671] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528559 [2024-08-29 07:53:58,689] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528560 [2024-08-29 07:53:58,698] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528561 [2024-08-29 07:53:58,706] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528562 [2024-08-29 07:53:58,720] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528563 [2024-08-29 07:53:58,732] [ERROR] [launch.py:325:sigkill_handler] ['/root/anaconda3/envs/glm-4-copy/bin/python', '-u', 'main.py', '--local_rank=7', '--model_name_or_path', '/root/AI4E/share/glm-4-9b', '--train_file', './data/glm4/longwriter', '--output_dir', './output/glm4/longwriter', '--num_train_epochs', '4', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '1', '--save_strategy', 'steps', '--save_steps', '400', '--save_total_limit', '10', '--preprocessing_num_workers', '64', '--learning_rate', '1e-5', '--weight_decay', '0.1', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_dir', './logs/', '--deepspeed', 'ds_config/stage3.json', '--bf16', '--gradient_checkpointing', '1', '--adam_beta1', '0.9', '--adam_beta2', '0.95', '--report_to', 'wandb', '--run_name', 'glm4_longwriter', '--logging_steps', '1', '--batch_method', 'pack', '--pack_loss'] exits with return code = 1

你好,请用LongWriter-glm4-9b的tokenizer代码,目前的训练代码没有支持最新版GLM-4-9b的tokenizer。

bys0318 avatar Aug 29 '24 08:08 bys0318

附上运行 ' ./scripts/glm4_longwriter.sh' 时的报错信息:

KeyError: '<|endoftext|>' Using unk_token, but it is not set yet. Traceback (most recent call last): File "/root/AI4E/ljc/LongWriter/train/main.py", line 139, in train() File "/root/AI4E/ljc/LongWriter/train/main.py", line 121, in train tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py", line 723, in from_pretrained return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 1854, in from_pretrained return cls._from_pretrained( ^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2090, in _from_pretrained added_tokens = tokenizer.sanitize_special_tokens() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 861, in sanitize_special_tokens return self.add_tokens(self.all_special_tokens_extended, special_tokens=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 1004, in add_tokens return self._add_tokens(new_tokens, special_tokens=special_tokens) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 421, in _add_tokens and self.convert_tokens_to_ids(token) == self.convert_tokens_to_ids(self.unk_token) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 582, in convert_tokens_to_ids return self._convert_token_to_id_with_added_voc(tokens) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 595, in _convert_token_to_id_with_added_voc return self._convert_token_to_id(token) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b/tokenization_chatglm.py", line 96, in _convert_token_to_id return self.mergeable_ranks[token] ~~~~~~~~~~~~~~~~~~~~^^^^^^^ KeyError: '<|endoftext|>' [2024-08-29 07:53:56,997] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528556 [2024-08-29 07:53:56,997] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528557 [2024-08-29 07:53:57,347] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528558 [2024-08-29 07:53:58,671] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528559 [2024-08-29 07:53:58,689] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528560 [2024-08-29 07:53:58,698] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528561 [2024-08-29 07:53:58,706] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528562 [2024-08-29 07:53:58,720] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528563 [2024-08-29 07:53:58,732] [ERROR] [launch.py:325:sigkill_handler] ['/root/anaconda3/envs/glm-4-copy/bin/python', '-u', 'main.py', '--local_rank=7', '--model_name_or_path', '/root/AI4E/share/glm-4-9b', '--train_file', './data/glm4/longwriter', '--output_dir', './output/glm4/longwriter', '--num_train_epochs', '4', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '1', '--save_strategy', 'steps', '--save_steps', '400', '--save_total_limit', '10', '--preprocessing_num_workers', '64', '--learning_rate', '1e-5', '--weight_decay', '0.1', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_dir', './logs/', '--deepspeed', 'ds_config/stage3.json', '--bf16', '--gradient_checkpointing', '1', '--adam_beta1', '0.9', '--adam_beta2', '0.95', '--report_to', 'wandb', '--run_name', 'glm4_longwriter', '--logging_steps', '1', '--batch_method', 'pack', '--pack_loss'] exits with return code = 1

你好,请用LongWriter-glm4-9b的tokenizer代码,目前的训练代码没有支持最新版GLM-4-9b的tokenizer。

RuntimeError: shape '[32768, -1, 1, 32, 2]' is invalid for input of size 524288 这个和你说的是同一个问题吗

badarrrr avatar Aug 29 '24 08:08 badarrrr

在deepspeed config里将stage3_prefetch_bucket_size设为15099494试试呢?

可以,但是会报新错误: RuntimeError: shape '[32768, -1, 1, 32, 2]' is invalid for input of size 524288

我遇到了跟你一模一样的错误: Traceback of TorchScript (most recent call last): File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b/modeling_chatglm.py", line 145, in apply_rotary_pos_emb rope_cache = rope_cache[:sq] xshaped = x.reshape(sq, -1, np, rot_dim // 2, 2) rope_cache = rope_cache.view(sq, -1, 1, xshaped.size(3), 2)

x_out2 = torch.stack(
[
RuntimeError: shape '[32768, -1, 1, 32, 2]' is invalid for input of size 524288

现在我也报这个错了

LYCnight avatar Aug 30 '24 02:08 LYCnight

现在会报两种类型的错误

  • 系统环境:
    • python==3.11.9
    • transformers==4.33.0
    • pytorch==2.2.0
    • /glm-4-9b 目录下的 modeling_chatglm.pytokenization_chatglm.py 都已经替换

错误一:RuntimeError: shape '[32768, -1, 1, 32, 2]' is invalid for input of size 524288

在 /ds_config/stage3.json 中设置 "stage3_prefetch_bucket_size": 15099494,

这样的话会一直运行到出现wandb界面,但在开始训练的时候就会报错:

 ^^^^^^^^  File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/trainer.py", line 2679, in training_step
^RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b/modeling_chatglm.py", line 146, in apply_rotary_pos_emb
    rope_cache = rope_cache[:sq]
    xshaped = x.reshape(sq, -1, np, rot_dim // 2, 2)
    rope_cache = rope_cache.view(sq, -1, 1, xshaped.size(3), 2)
                 ~~~~~~~~~~~~~~~ <--- HERE
    x_out2 = torch.stack(
        [
RuntimeError: shape '[32768, -1, 1, 32, 2]' is invalid for input of size 524288

错误二:Input should be a valid integer, got a number with a fractional part

在 /ds_config/stage3.json 中设置 "stage3_prefetch_bucket_size": "auto",

这样设置并运行的话会在wandb出现之前就报错:

  File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/deepspeed/runtime/config.py", line 817, in _initialize_params
    self.zero_config = get_zero_config(param_dict)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/deepspeed/runtime/zero/config.py", line 71, in get_zero_config
    return DeepSpeedZeroConfig(**zero_config_dict)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/deepspeed/runtime/config_utils.py", line 57, in __init__
    super().__init__(**data)
  File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/pydantic/main.py", line 193, in __init__
    self.__pydantic_validator__.validate_python(data, self_instance=self)
pydantic_core._pydantic_core.ValidationError: 1 validation error for DeepSpeedZeroConfig
stage3_prefetch_bucket_size
  Input should be a valid integer, got a number with a fractional part [type=int_from_float, input_value=15099494.4, input_type=float]
    For further information visit https://errors.pydantic.dev/2.8/v/int_from_floa

LYCnight avatar Aug 30 '24 02:08 LYCnight

附上运行 ' ./scripts/glm4_longwriter.sh' 时的报错信息:

KeyError: '<|endoftext|>' Using unk_token, but it is not set yet. Traceback (most recent call last): File "/root/AI4E/ljc/LongWriter/train/main.py", line 139, in train() File "/root/AI4E/ljc/LongWriter/train/main.py", line 121, in train tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py", line 723, in from_pretrained return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 1854, in from_pretrained return cls._from_pretrained( ^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2090, in _from_pretrained added_tokens = tokenizer.sanitize_special_tokens() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 861, in sanitize_special_tokens return self.add_tokens(self.all_special_tokens_extended, special_tokens=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 1004, in add_tokens return self._add_tokens(new_tokens, special_tokens=special_tokens) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 421, in _add_tokens and self.convert_tokens_to_ids(token) == self.convert_tokens_to_ids(self.unk_token) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 582, in convert_tokens_to_ids return self._convert_token_to_id_with_added_voc(tokens) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 595, in _convert_token_to_id_with_added_voc return self._convert_token_to_id(token) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b/tokenization_chatglm.py", line 96, in _convert_token_to_id return self.mergeable_ranks[token] ~~~~~~~~~~~~~~~~~~~~^^^^^^^ KeyError: '<|endoftext|>' [2024-08-29 07:53:56,997] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528556 [2024-08-29 07:53:56,997] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528557 [2024-08-29 07:53:57,347] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528558 [2024-08-29 07:53:58,671] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528559 [2024-08-29 07:53:58,689] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528560 [2024-08-29 07:53:58,698] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528561 [2024-08-29 07:53:58,706] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528562 [2024-08-29 07:53:58,720] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528563 [2024-08-29 07:53:58,732] [ERROR] [launch.py:325:sigkill_handler] ['/root/anaconda3/envs/glm-4-copy/bin/python', '-u', 'main.py', '--local_rank=7', '--model_name_or_path', '/root/AI4E/share/glm-4-9b', '--train_file', './data/glm4/longwriter', '--output_dir', './output/glm4/longwriter', '--num_train_epochs', '4', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '1', '--save_strategy', 'steps', '--save_steps', '400', '--save_total_limit', '10', '--preprocessing_num_workers', '64', '--learning_rate', '1e-5', '--weight_decay', '0.1', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_dir', './logs/', '--deepspeed', 'ds_config/stage3.json', '--bf16', '--gradient_checkpointing', '1', '--adam_beta1', '0.9', '--adam_beta2', '0.95', '--report_to', 'wandb', '--run_name', 'glm4_longwriter', '--logging_steps', '1', '--batch_method', 'pack', '--pack_loss'] exits with return code = 1

你好,请用LongWriter-glm4-9b的tokenizer代码,目前的训练代码没有支持最新版GLM-4-9b的tokenizer。

你好,请问有solution了吗,还是想跑一下训练

LYCnight avatar Sep 02 '24 01:09 LYCnight

附上运行 ' ./scripts/glm4_longwriter.sh' 时的报错信息:

KeyError: '<|endoftext|>' Using unk_token, but it is not set yet. Traceback (most recent call last): File "/root/AI4E/ljc/LongWriter/train/main.py", line 139, in train() File "/root/AI4E/ljc/LongWriter/train/main.py", line 121, in train tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py", line 723, in from_pretrained return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 1854, in from_pretrained return cls._from_pretrained( ^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2090, in _from_pretrained added_tokens = tokenizer.sanitize_special_tokens() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 861, in sanitize_special_tokens return self.add_tokens(self.all_special_tokens_extended, special_tokens=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 1004, in add_tokens return self._add_tokens(new_tokens, special_tokens=special_tokens) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 421, in _add_tokens and self.convert_tokens_to_ids(token) == self.convert_tokens_to_ids(self.unk_token) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 582, in convert_tokens_to_ids return self._convert_token_to_id_with_added_voc(tokens) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 595, in _convert_token_to_id_with_added_voc return self._convert_token_to_id(token) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b/tokenization_chatglm.py", line 96, in _convert_token_to_id return self.mergeable_ranks[token] ~~~~~~~~~~~~~~~~~~~~^^^^^^^ KeyError: '<|endoftext|>' [2024-08-29 07:53:56,997] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528556 [2024-08-29 07:53:56,997] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528557 [2024-08-29 07:53:57,347] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528558 [2024-08-29 07:53:58,671] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528559 [2024-08-29 07:53:58,689] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528560 [2024-08-29 07:53:58,698] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528561 [2024-08-29 07:53:58,706] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528562 [2024-08-29 07:53:58,720] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 528563 [2024-08-29 07:53:58,732] [ERROR] [launch.py:325:sigkill_handler] ['/root/anaconda3/envs/glm-4-copy/bin/python', '-u', 'main.py', '--local_rank=7', '--model_name_or_path', '/root/AI4E/share/glm-4-9b', '--train_file', './data/glm4/longwriter', '--output_dir', './output/glm4/longwriter', '--num_train_epochs', '4', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '1', '--save_strategy', 'steps', '--save_steps', '400', '--save_total_limit', '10', '--preprocessing_num_workers', '64', '--learning_rate', '1e-5', '--weight_decay', '0.1', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_dir', './logs/', '--deepspeed', 'ds_config/stage3.json', '--bf16', '--gradient_checkpointing', '1', '--adam_beta1', '0.9', '--adam_beta2', '0.95', '--report_to', 'wandb', '--run_name', 'glm4_longwriter', '--logging_steps', '1', '--batch_method', 'pack', '--pack_loss'] exits with return code = 1

你好,请用LongWriter-glm4-9b的tokenizer代码,目前的训练代码没有支持最新版GLM-4-9b的tokenizer。

你好,请问有solution了吗,还是想跑一下训练

你好,从报错信息看代码运行时用的还是glm-4-9b原本的tokenization_chatglm.py,并不是LongWriter-glm4-9btokenization_chatglm.py。请确认main.py里model和tokenizer载入时是否加了trust_remote_code=True

bys0318 avatar Sep 03 '24 11:09 bys0318

现在会报两种类型的错误

  • 系统环境:

    • python==3.11.9
    • transformers==4.33.0
    • pytorch==2.2.0
    • /glm-4-9b 目录下的 modeling_chatglm.pytokenization_chatglm.py 都已经替换

错误一:RuntimeError: shape '[32768, -1, 1, 32, 2]' is invalid for input of size 524288

在 /ds_config/stage3.json 中设置 "stage3_prefetch_bucket_size": 15099494,

这样的话会一直运行到出现wandb界面,但在开始训练的时候就会报错:

 ^^^^^^^^  File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/trainer.py", line 2679, in training_step
^RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b/modeling_chatglm.py", line 146, in apply_rotary_pos_emb
    rope_cache = rope_cache[:sq]
    xshaped = x.reshape(sq, -1, np, rot_dim // 2, 2)
    rope_cache = rope_cache.view(sq, -1, 1, xshaped.size(3), 2)
                 ~~~~~~~~~~~~~~~ <--- HERE
    x_out2 = torch.stack(
        [
RuntimeError: shape '[32768, -1, 1, 32, 2]' is invalid for input of size 524288

错误二:Input should be a valid integer, got a number with a fractional part

在 /ds_config/stage3.json 中设置 "stage3_prefetch_bucket_size": "auto",

这样设置并运行的话会在wandb出现之前就报错:

  File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/deepspeed/runtime/config.py", line 817, in _initialize_params
    self.zero_config = get_zero_config(param_dict)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/deepspeed/runtime/zero/config.py", line 71, in get_zero_config
    return DeepSpeedZeroConfig(**zero_config_dict)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/deepspeed/runtime/config_utils.py", line 57, in __init__
    super().__init__(**data)
  File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/pydantic/main.py", line 193, in __init__
    self.__pydantic_validator__.validate_python(data, self_instance=self)
pydantic_core._pydantic_core.ValidationError: 1 validation error for DeepSpeedZeroConfig
stage3_prefetch_bucket_size
  Input should be a valid integer, got a number with a fractional part [type=int_from_float, input_value=15099494.4, input_type=float]
    For further information visit https://errors.pydantic.dev/2.8/v/int_from_floa

对于错误二,请把"stage3_prefetch_bucket_size": "auto"改为15099494。

bys0318 avatar Sep 03 '24 11:09 bys0318

从 https://github.com/hiyouga/LLaMA-Factory/issues/5252 这个issue来看,"stage3_prefetch_bucket_size": "auto"报错可以通过降低deepspeed版本解决,试试pip install deepspeed==0.14.4

bys0318 avatar Sep 03 '24 11:09 bys0318

@LYCnight @badarrrr 请看我们在README中的FAQ是否能解决你们遇到的问题。不好意思让你们久等了。

bys0318 avatar Sep 03 '24 15:09 bys0318

@LYCnight @badarrrr 请看我们在README中的FAQ是否能解决你们遇到的问题。不好意思让你们久等了。

非常感谢!我已经train成功了,分享一些经验:https://github.com/THUDM/LongWriter/issues/25

LYCnight avatar Sep 04 '24 02:09 LYCnight