Baichuan-7B icon indicating copy to clipboard operation
Baichuan-7B copied to clipboard

[Question] 单机单卡训练,报错,无法初始化梯度。

Open xkjcf opened this issue 1 year ago • 8 comments

Required prerequisites

Questions

下载了model,创建了data_dir目录,创建了一个新的script/train2.sh脚本。 #!/bin/bash deepspeed train.py \ --deepspeed \ --deepspeed_config config/deepspeed.json 运行该脚本,报如下的错误: Traceback (most recent call last): File "/root/code/Baichuan-7B/train.py", line 138, in <module> model_engine = prepare_model() File "/root/code/Baichuan-7B/train.py", line 117, in prepare_model model_engine, _, _, _ = deepspeed.initialize(args=args, File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/__init__.py", line 165, in initialize engine = DeepSpeedEngine(args=args, File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 308, in __init__ self._configure_optimizer(optimizer, model_parameters) File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1173, in _configure_optimizer self.optimizer = self._configure_zero_optimizer(basic_optimizer) File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1409, in _configure_zero_optimizer optimizer = DeepSpeedZeroOptimizer( File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 468, in __init__ self.initialize_gradient_partitioning_data_structures() File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 691, in initialize_gradient_partitioning_data_structures self.first_param_index_in_partition[i][partition_id] = self.get_first_param_index( File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 666, in get_first_param_index if partition_id in self.param_to_partition_ids[group_id][param_id]: KeyError: 0

data_dir中的训练文档为普通的多行文本。

Checklist

  • [X] I have provided all relevant and necessary information above.
  • [X] I have chosen a suitable title for this issue.

xkjcf avatar Jul 11 '23 07:07 xkjcf

同问,Python 3.9也不行,换了机器也不行。

DoliteMatheo avatar Jul 17 '23 10:07 DoliteMatheo

同问,遇到了相同的问题。 另一个问题时requirement 中版本有冲突 The conflict is caused by: The user requested torch==2.0.0 deepspeed 0.9.2 depends on torch xformers 0.0.20 depends on torch==2.0.1

LiManshiang avatar Jul 20 '23 02:07 LiManshiang

+1

Aurora-slz avatar Jul 24 '23 07:07 Aurora-slz

同问

kztao avatar Jul 27 '23 02:07 kztao

同问,遇到了相同的问题。 另一个问题时requirement 中版本有冲突 The conflict is caused by: The user requested torch==2.0.0 deepspeed 0.9.2 depends on torch xformers 0.0.20 depends on torch==2.0.1

我在其他issue里也看到了,安装的也是torch==2.0.1,但仍然出现上面的问题。请问大家是如何解决的呢?

hingkan avatar Aug 03 '23 02:08 hingkan

我也遇到了同样的问题,在deepspeed issues中有找到相关说明https://github.com/microsoft/DeepSpeed/issues/3234,ZeRO stage 3支持zero.init,stage 1和2不支持,我把deepspeed.json中stage改成3解决了这个问题

xinruozhang575 avatar Nov 10 '23 08:11 xinruozhang575

我也遇到了同样的问题,在deepspeed issues中有找到相关说明https://github.com/microsoft/DeepSpeed/issues/3234,ZeRO stage 3支持zero.init,stage 1和2不支持,我把deepspeed.json中stage改成3解决了这个问题

image 按你的方法修改后有新的报错,你有遇到吗

Silentssss avatar Jan 08 '24 10:01 Silentssss

我也遇到了同样的问题,在deepspeed issues中有找到相关说明https://github.com/microsoft/DeepSpeed/issues/3234,ZeRO stage 3支持zero.init,stage 1和2不支持,我把deepspeed.json中stage改成3解决了这个问题

image 按你的方法修改后有新的报错,你有遇到吗

我也遇到这个问题了 有解决方法吗

ucaslei avatar Aug 06 '24 06:08 ucaslei