Qwen icon indicating copy to clipboard operation
Qwen copied to clipboard

72B-lora-zero3 脚本报错

Open zeroleavebaoyang opened this issue 2 years ago • 4 comments

环境: transformers 4.34.0 torch 2.0.1+cu118 deepspeed 0.12.4 flash-attn 2.3.2

脚本:finetune_lora_ds.sh zero3

代码版本:使用最新代码

报错日志: ninja: no work to do. Loading extension module cpu_adam... Time to load cpu_adam op: 0.5232646465301514 seconds Loading extension module cpu_adam... Time to load cpu_adam op: 0.6317603588104248 seconds Loading extension module cpu_adam... Time to load cpu_adam op: 0.5552959442138672 seconds Loading extension module cpu_adam... Time to load cpu_adam op: 0.5726652145385742 seconds Loading extension module cpu_adam... Loading extension module cpu_adam... Time to load cpu_adam op: 0.546497106552124 seconds Time to load cpu_adam op: 0.5796217918395996 seconds Parameter Offload: Total persistent parameters: 3284992 in 243 params Traceback (most recent call last): File "/home/xiaoi/pan/ssh/Qwen/finetune.py", line 360, in train() File "/home/xiaoi/pan/ssh/Qwen/finetune.py", line 353, in train trainer.train() File "/home/xiaoi/anaconda3/envs/torch_p/lib/python3.10/site-packages/transformers/trainer.py", line 1591, in train return inner_training_loop( File "/home/xiaoi/anaconda3/envs/torch_p/lib/python3.10/site-packages/transformers/trainer.py", line 1892, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/home/xiaoi/anaconda3/envs/torch_p/lib/python3.10/site-packages/transformers/trainer.py", line 2776, in training_step loss = self.compute_loss(model, inputs) File "/home/xiaoi/anaconda3/envs/torch_p/lib/python3.10/site-packages/transformers/trainer.py", line 2801, in compute_loss outputs = model(**inputs) File "/home/xiaoi/anaconda3/envs/torch_p/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/xiaoi/anaconda3/envs/torch_p/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/home/xiaoi/anaconda3/envs/torch_p/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1822, in forward loss = self.module(*inputs, **kwargs) File "/home/xiaoi/anaconda3/envs/torch_p/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl Traceback (most recent call last): File "/home/xiaoi/pan/ssh/Qwen/finetune.py", line 360, in result = forward_call(*args, **kwargs) File "/home/xiaoi/anaconda3/envs/torch_p/lib/python3.10/site-packages/peft/peft_model.py", line 918, in forward train() File "/home/xiaoi/pan/ssh/Qwen/finetune.py", line 353, in train trainer.train() File "/home/xiaoi/anaconda3/envs/torch_p/lib/python3.10/site-packages/transformers/trainer.py", line 1591, in train return self.base_model( File "/home/xiaoi/anaconda3/envs/torch_p/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl result = forward_call(*args, **kwargs) File "/home/xiaoi/anaconda3/envs/torch_p/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 94, in forward return self.model.forward(*args, **kwargs) File "/home/xiaoi/.cache/huggingface/modules/transformers_modules/Qwen-72B/modeling_qwen.py", line 1045, in forward return inner_training_loop( File "/home/xiaoi/anaconda3/envs/torch_p/lib/python3.10/site-packages/transformers/trainer.py", line 1892, in _inner_training_loop transformer_outputs = self.transformer( File "/home/xiaoi/anaconda3/envs/torch_p/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl tr_loss_step = self.training_step(model, inputs) File "/home/xiaoi/anaconda3/envs/torch_p/lib/python3.10/site-packages/transformers/trainer.py", line 2776, in training_step result = forward_call(*args, **kwargs) File "/home/xiaoi/.cache/huggingface/modules/transformers_modules/Qwen-72B/modeling_qwen.py", line 824, in forward inputs_embeds = self.wte(input_ids) File "/home/xiaoi/anaconda3/envs/torch_p/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl result = forward_call(*args, **kwargs) File "/home/xiaoi/anaconda3/envs/torch_p/lib/python3.10/site-packages/peft/utils/other.py", line 186, in forward loss = self.compute_loss(model, inputs) File "/home/xiaoi/anaconda3/envs/torch_p/lib/python3.10/site-packages/transformers/trainer.py", line 2801, in compute_loss return self.modules_to_save[self.active_adapter](*args, **kwargs) File "/home/xiaoi/anaconda3/envs/torch_p/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl result = hook(self, args) File "/home/xiaoi/anaconda3/envs/torch_p/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/home/xiaoi/anaconda3/envs/torch_p/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook self.pre_sub_module_forward_function(module) File "/home/xiaoi/anaconda3/envs/torch_p/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state) File "/home/xiaoi/anaconda3/envs/torch_p/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/home/xiaoi/anaconda3/envs/torch_p/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context outputs = model(**inputs) File "/home/xiaoi/anaconda3/envs/torch_p/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return func(*args, **kwargs) File "/home/xiaoi/anaconda3/envs/torch_p/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 310, in fetch_sub_module assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary() AssertionError: {'id': 643, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 0, 'shape': (0,), 'ds_shape': (0,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {7}, 'ds_tensor.shape': torch.Size([0])} return forward_call(*args, **kwargs)

zeroleavebaoyang avatar Dec 03 '23 06:12 zeroleavebaoyang

您好,请问您是微调的72B Base模型吗(可以提供下微调的模型名称)? 文档中提到,如果微调了Base模型(或者名字不带有"chat"的模型),则会将embedding加入finetune中,目前ZeRO 3对这种方式的支持仍然存在issue中提到的问题,建议修改finetune.py代码,显式将embedding排除出微调参数:

if lora_args.q_lora or 'chat' in model_args.model_name_or_path.lower():
    modules_to_save = None
else:
    # modules_to_save = ["wte", "lm_head"]
    modules_to_save = None   # 修改为这一行

fyabc avatar Dec 04 '23 09:12 fyabc

您好,请问您是微调的72B Base模型吗(可以提供下微调的模型名称)? 文档中提到,如果微调了Base模型(或者名字不带有"chat"的模型),则会将embedding加入finetune中,目前ZeRO 3对这种方式的支持仍然存在issue中提到的问题,建议修改finetune.py代码,显式将embedding排除出微调参数:

if lora_args.q_lora or 'chat' in model_args.model_name_or_path.lower():
    modules_to_save = None
else:
    # modules_to_save = ["wte", "lm_head"]
    modules_to_save = None   # 修改为这一行

想请问一下如果调base不存wte和lm_head,输出的预期将是什么样的?就是学不会|<im_strat>| |<im_end>|两个token吗?也就是想问一下对训练得到模型的性能上有什么样的损失呢?

Luobots avatar Dec 04 '23 12:12 Luobots

您好,请问您是微调的72B Base模型吗(可以提供下微调的模型名称)? 文档中提到,如果微调了Base模型(或者名字不带有"chat"的模型),则会将embedding加入finetune中,目前ZeRO 3对这种方式的支持仍然存在issue中提到的问题,建议修改finetune.py代码,显式将embedding排除出微调参数:

if lora_args.q_lora or 'chat' in model_args.model_name_or_path.lower():
    modules_to_save = None
else:
    # modules_to_save = ["wte", "lm_head"]
    modules_to_save = None   # 修改为这一行

想请问一下如果调base不存wte和lm_head,输出的预期将是什么样的?就是学不会|<im_strat>| |<im_end>|两个token吗?也就是想问一下对训练得到模型的性能上有什么样的损失呢?

同问

chenyzh28 avatar Dec 06 '23 03:12 chenyzh28

@Luobots @chenyzh28 微调base且不微调embedding的情况下,无法学到两个特殊token,可能对性能有一定影响,具体有多大影响我们这边暂时没有详细的数据;我们正在开发代码以解决base模型无法微调embedding问题。

fyabc avatar Dec 08 '23 02:12 fyabc

您好,请问您是微调的72B Base模型吗(可以提供下微调的模型名称)? 文档中提到,如果微调了Base模型(或者名字不带有"chat"的模型),则会将embedding加入finetune中,目前ZeRO 3对这种方式的支持仍然存在issue中提到的问题,建议修改finetune.py代码,显式将embedding排除出微调参数:

if lora_args.q_lora or 'chat' in model_args.model_name_or_path.lower():
    modules_to_save = None
else:
    # modules_to_save = ["wte", "lm_head"]
    modules_to_save = None   # 修改为这一行

想请问一下如果调base不存wte和lm_head,输出的预期将是什么样的?就是学不会|<im_strat>| |<im_end>|两个token吗?也就是想问一下对训练得到模型的性能上有什么样的损失呢?

尝试了一下,微调之后可能出现生成无法停止的情况,正常的结果生成完了之后会继续生成随机(大概)token,然后还可能会报“unknown ids”的错。

Akira4ever avatar Jan 30 '24 07:01 Akira4ever