xtuner
xtuner copied to clipboard
使用zero3_offload+序列并行训练yi-34b的时候出错
我只用github上提供的配置文件yi_34b_200k_full_alpaca_enzh_32k_sp8,运行时的deepspeed选项是zero3_offload 但是出现如下错误,请问现在序列并行是不支持offload吗,还是有别的原因? 谢谢。
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "/opt/ml/job/xtuner/tools/train.py", line 342, in
File "/opt/ml/job/xtuner/tools/train.py", line 338, in main File "/opt/ml/job/xtuner/tools/train.py", line 338, in main main() File "/opt/ml/job/xtuner/tools/train.py", line 338, in main runner.train() File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/_flexible_runner.py", line 1200, in train runner.train()runner.train()
File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/_flexible_runner.py", line 1200, in train File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/_flexible_runner.py", line 1200, in train runner.train() File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/_flexible_runner.py", line 1200, in train model = self.train_loop.run() # type: ignore File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 286, in run model = self.train_loop.run() # type: ignore File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 286, in run model = self.train_loop.run() # type: ignore File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 286, in run model = self.train_loop.run() # type: ignore File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 286, in run self.run_iter(data_batch) File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 309, in run_iter self.run_iter(data_batch) File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 309, in run_iter self.run_iter(data_batch) File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 309, in run_iter self.run_iter(data_batch) File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 309, in run_iter outputs = self.runner.model.train_step( File "/usr/local/lib/python3.10/dist-packages/mmengine/_strategy/deepspeed.py", line 135, in train_step outputs = self.runner.model.train_step( File "/usr/local/lib/python3.10/dist-packages/mmengine/_strategy/deepspeed.py", line 135, in train_step outputs = self.runner.model.train_step( File "/usr/local/lib/python3.10/dist-packages/mmengine/_strategy/deepspeed.py", line 135, in train_step optim_wrapper.update_params(parsed_loss) outputs = self.runner.model.train_step( File "/usr/local/lib/python3.10/dist-packages/mmengine/_strategy/deepspeed.py", line 83, in update_params
File "/usr/local/lib/python3.10/dist-packages/mmengine/_strategy/deepspeed.py", line 135, in train_step
optim_wrapper.update_params(parsed_loss)
optim_wrapper.update_params(parsed_loss)
File "/usr/local/lib/python3.10/dist-packages/mmengine/_strategy/deepspeed.py", line 83, in update_params
File "/usr/local/lib/python3.10/dist-packages/mmengine/_strategy/deepspeed.py", line 83, in update_params
self.step()
File "/usr/local/lib/python3.10/dist-packages/mmengine/optim/scheduler/param_scheduler.py", line 115, in wrapper
optim_wrapper.update_params(parsed_loss)
File "/usr/local/lib/python3.10/dist-packages/mmengine/_strategy/deepspeed.py", line 83, in update_params
self.step()
self.step()
File "/usr/local/lib/python3.10/dist-packages/mmengine/optim/scheduler/param_scheduler.py", line 115, in wrapper
File "/usr/local/lib/python3.10/dist-packages/mmengine/optim/scheduler/param_scheduler.py", line 115, in wrapper
return wrapped(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/mmengine/_strategy/deepspeed.py", line 95, in step
self.step()
File "/usr/local/lib/python3.10/dist-packages/mmengine/optim/scheduler/param_scheduler.py", line 115, in wrapper
return wrapped(*args, **kwargs)self.model.step()
Traceback (most recent call last):
return wrapped(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/mmengine/_strategy/deepspeed.py", line 95, in step
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 2169, in step
File "/opt/ml/job/xtuner/tools/train.py", line 342, in
Traceback (most recent call last):
return wrapped(*args, **kwargs) File "/opt/ml/job/xtuner/tools/train.py", line 342, in
File "/opt/ml/job/xtuner/tools/train.py", line 342, in
File "/usr/local/lib/python3.10/dist-packages/mmengine/_strategy/deepspeed.py", line 95, in step
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 2169, in step
self.model.step()
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 2169, in step
self.model.step()
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 2169, in step
main()
File "/opt/ml/job/xtuner/tools/train.py", line 338, in main
main()
File "/opt/ml/job/xtuner/tools/train.py", line 338, in main
main()
File "/opt/ml/job/xtuner/tools/train.py", line 338, in main
self._take_model_step(lr_kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 2075, in _take_model_step
runner.train()
File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/_flexible_runner.py", line 1200, in train
self._take_model_step(lr_kwargs)
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 2075, in _take_model_step
self._take_model_step(lr_kwargs) File "/opt/ml/job/xtuner/tools/train.py", line 342, in
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 2075, in _take_model_step File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 2075, in _take_model_step runner.train() File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/_flexible_runner.py", line 1200, in train
File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/_flexible_runner.py", line 1200, in train
model = self.train_loop.run() # type: ignore
File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 286, in run
self.optimizer.step()
File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
self.run_iter(data_batch)
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 309, in run_iter
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2047, in step
self.optimizer.step()model = self.train_loop.run() # type: ignore
File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 286, in run
main()
File "/opt/ml/job/xtuner/tools/train.py", line 338, in main
model = self.train_loop.run() # type: ignoreself.optimizer.step()
self.optimizer.step() ret_val = func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 286, in run
File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn outputs = self.runner.model.train_step( File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2047, in step
File "/usr/local/lib/python3.10/dist-packages/mmengine/_strategy/deepspeed.py", line 135, in train_step
self.run_iter(data_batch)
File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 309, in run_iter
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2047, in step
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2047, in step
runner.train()
File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/_flexible_runner.py", line 1200, in train
optim_wrapper.update_params(parsed_loss)
self.run_iter(data_batch)
File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 309, in run_iter
File "/usr/local/lib/python3.10/dist-packages/mmengine/_strategy/deepspeed.py", line 83, in update_params
outputs = self.runner.model.train_step(
File "/usr/local/lib/python3.10/dist-packages/mmengine/_strategy/deepspeed.py", line 135, in train_step
self.step()
File "/usr/local/lib/python3.10/dist-packages/mmengine/optim/scheduler/param_scheduler.py", line 115, in wrapper
optim_wrapper.update_params(parsed_loss)
File "/usr/local/lib/python3.10/dist-packages/mmengine/_strategy/deepspeed.py", line 83, in update_params
outputs = self.runner.model.train_step(
File "/usr/local/lib/python3.10/dist-packages/mmengine/_strategy/deepspeed.py", line 135, in train_step
return wrapped(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/mmengine/_strategy/deepspeed.py", line 95, in step
self.step()
self.unscale_and_clip_grads(sub_group_id, scaled_global_grad_norm)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
File "/usr/local/lib/python3.10/dist-packages/mmengine/optim/scheduler/param_scheduler.py", line 115, in wrapper
self.model.step()optim_wrapper.update_params(parsed_loss)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 2169, in step
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/mmengine/_strategy/deepspeed.py", line 83, in update_params
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2117, in unscale_and_clip_grads
return wrapped(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/mmengine/_strategy/deepspeed.py", line 95, in step
model = self.train_loop.run() # type: ignore
self.step()
File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 286, in run
File "/usr/local/lib/python3.10/dist-packages/mmengine/optim/scheduler/param_scheduler.py", line 115, in wrapper
self.unscale_and_clip_grads(sub_group_id, scaled_global_grad_norm)self.model.step()
File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
self.unscale_and_clip_grads(sub_group_id, scaled_global_grad_norm) File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 2169, in step
File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
self.unscale_and_clip_grads(sub_group_id, scaled_global_grad_norm)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
return wrapped(*args, **kwargs)
ret_val = func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/mmengine/_strategy/deepspeed.py", line 95, in step
self.run_iter(data_batch) File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2117, in unscale_and_clip_grads
ret_val = func(*args, **kwargs)ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 309, in run_iter File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2117, in unscale_and_clip_grads
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2117, in unscale_and_clip_grads
self.model.step()
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 2169, in step
outputs = self.runner.model.train_step(
File "/usr/local/lib/python3.10/dist-packages/mmengine/strategy/deepspeed.py", line 135, in train_step
self.fp32_partitioned_groups_flat[sub_group_id].grad.mul(1. / combined_scale)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cpu!
optim_wrapper.update_params(parsed_loss)
File "/usr/local/lib/python3.10/dist-packages/mmengine/_strategy/deepspeed.py", line 83, in update_params
self._take_model_step(lr_kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 2075, in _take_model_step
self.step()
File "/usr/local/lib/python3.10/dist-packages/mmengine/optim/scheduler/param_scheduler.py", line 115, in wrapper
self._take_model_step(lr_kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 2075, in take_model_step
return wrapped(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/mmengine/strategy/deepspeed.py", line 95, in step
self.fp32_partitioned_groups_flat[sub_group_id].grad.mul(1. / combined_scale)
RuntimeErrorself.model.step():
Expected all tensors to be on the same device, but found at least two devices, cuda:5 and cpu! self.fp32_partitioned_groups_flat[sub_group_id].grad.mul(1. / combined_scale) File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 2169, in step
self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_(1. / combined_scale)
RuntimeErrorRuntimeErrorself._take_model_step(lr_kwargs): :
Expected all tensors to be on the same device, but found at least two devices, cuda:7 and cpu!Expected all tensors to be on the same device, but found at least two devices, cuda:6 and cpu! File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 2075, in _take_model_step
self.optimizer.step()
File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2047, in step
self.optimizer.step()
File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2047, in step
self.take_model_step(lr_kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 2075, in take_model_step
self.optimizer.step()
File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2047, in step
self.unscale_and_clip_grads(sub_group_id, scaled_global_grad_norm)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2117, in unscale_and_clip_grads
self.unscale_and_clip_grads(sub_group_id, scaled_global_grad_norm)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2117, in unscale_and_clip_grads
self.optimizer.step()
File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2047, in step
self.unscale_and_clip_grads(sub_group_id, scaled_global_grad_norm)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2117, in unscale_and_clip_grads
self.fp32_partitioned_groups_flat[sub_group_id].grad.mul(1. / combined_scale)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:4 and cpu!
self.fp32_partitioned_groups_flat[sub_group_id].grad.mul(1. / combined_scale)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cpu!
self.unscale_and_clip_grads(sub_group_id, scaled_global_grad_norm)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2117, in unscale_and_clip_grads
self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_(1. / combined_scale)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cpu!
self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_(1. / combined_scale)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1336) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
很抱歉给您的使用带来了不便!
我刚刚测试了下 Yi-200K-34B 的 CPU_Offload 全量微调,没有成功复现出您的问题。以下是我序列并行度为2的config文件和训练Config & Log: 34b_8k_sp2_config 34b_8k_sp2_log.log
理论上序列并行跟CPU Offload是不会互相影响的。所以需要您先将序列并行关掉(通过设置sequence_parallel_size = 1)再测试 CPU Offload 训练,看看是否会报错。如果仍然报错,可能需要您检查您的环境是否安装正确。
有进一步的结果后欢迎联系我们!
您好,感谢回复,我这边试了一下8k的sp2,但是还是同样的问题,可以提供一下您那边的运行环境吗? 我现在的配置文件是: yi_34b_200k_full_alpaca_zh_32k_sp8.log 运行环境是: deepspeed 0.14.1 transformers 4.40.0 xtuner 0.1.18.dev0 torch 2.0.0+cu118
您好,感谢回复,我这边试了一下8k的sp2,但是还是同样的问题,可以提供一下您那边的运行环境吗? 我现在的配置文件是: yi_34b_200k_full_alpaca_zh_32k_sp8.log 运行环境是: deepspeed 0.14.1 transformers 4.40.0 xtuner 0.1.18.dev0 torch 2.0.0+cu118
麻烦先尝试把序列并行关掉(通过设置sequence_parallel_size = 1)再测试 CPU Offload 训练,看看是否会报错。我有点担心不是序列并行引入的bug
您好,我这边确认问题了,我之前不论怎么改序列并行的设置,都会报一样的错误。我后来把deepspeed的版本从0.14.0降到0.12.3,就没问题了,感谢耐心的解答哈!
另外我还有个问题,就是我这边虽然能跑起来了,但是我发现训练的步长有问题,我把的设置如下: sequence_parallel_size=8 batch_size = 1 accumulative_counts = 8 max_epochs = 3 使用alpaca_ch这个数据集,发现训练的总步数只有32,这个感觉不太对啊,alpaca-data-gpt4-chinese这个数据集,总共有5万多个样本,3个epoch,不应该总步数只有32的,辛苦帮忙看一下,谢谢!
问下你的序列长度 (max_length
) 设的多少呢
长度是4096,这个默认会多个样本拼接到一起吗?我刚刚改成了8192,总步长现在还是32。。。但是显存占用的确是增加了
把 https://github.com/InternLM/xtuner/blob/main/xtuner/dataset/huggingface.py#L96 这行代码改成
num_proc=1
再试试呢?
改了之后还是没有变化,您能简单介绍一下这个32是怎么计算来的吗? 谢谢~
在数据预处理的时候,我们通过 Huggingface datasets 的 map_fn 接口实现数据拼接功能, 默认使用32个进程同时预处理。某个进程默认一次输入1000条数据,并将其拼接为多条长数据,最后余下的部分会被舍弃。
如果 max_length 较大,舍弃的部分就越多。但我感觉8192不算太长,应该不会导致数据集大幅度下降。
我这边建议先把Huggingface datasets 的缓存清掉,默认是在 ~/.cache/huggingface/datasets/ 目录下,之后把 https://github.com/InternLM/xtuner/blob/main/xtuner/dataset/huggingface.py#L96 这行代码改成 num_proc=1
以避免数据丢失。再重新处理一遍数据集,看看是否仍然是 32 条数据。
好的,我试一下,我看您发的那个log里面,好像也是32个步数,感觉不是我自己的个例
好的,我试一下,我看您发的那个log里面,好像也是32个步数,感觉不是我自己的个例
我才想起来,我这个 log 里是32步是因为我设置了:
train_cfg = dict(type=TrainLoop, max_iters=32)
之前为了测速,只跑了前 32 个 iters。
这是我更新后的 config 和 log: 34b_32k_sp2_config.txt yi_34b_32k_sp2_log.txt
为了检查您训练 iter 只有 32 的问题,可能麻烦您:
- 先检查 config 中
train_cfg
的设置,看看是否设置了max_iters
这个参数 - 进入
~/.cache/huggingface/datasets
目录下,应该会有silk-road___alpaca-data-gpt4-chinese
和tatsu-lab___alpaca
两个文件夹分别对应 alpaca_zh 和 alpaca。删除里面的所有cache-*
的文件(map_fn的中间产物)rm tatsu-lab___alpaca/default/0.0.0/dce01c9b08f87459cf36a430d809084718273017/cache-*
以及rm silk-road___alpaca-data-gpt4-chinese/default/0.0.0/81a6dfd72f416aff605e7d189bfbbc46a2511fee/cache-*
(注意替换 hash 值为您的 hash 值)。删除后再重新运行。
如果仍然有问题,欢迎继续讨论!
max_iters的确是32,我把她改成max_epochs的值就可以了,现在已经可以正常跑起来了,非常感谢您的耐心解答!
@HIT-cwh 您好,我这边又遇到新的问题了。。。就是在一个epoch结束之后,就会oom,而且这个问题是稳定出现的。就是在一个epoch结束后的第7个step,我配置的序列并行数量是8。这是应该是没有在保存模型,就是比较正常的一次迭代,是不是不同数据的前后拼接时候出了问题呢?
04/24 17:10:53 - mmengine - INFO - Iter(train) [31/96] lr: 1.6222e-05 eta: 0:11:18 time: 6.3558 data_time: 0.0073 memory: 7805 loss: 1.5815 tflops: 31.7220 tokens_per_sec: 109.9789
04/24 17:11:38 - mmengine - INFO - Exp name: qwen1.5_32b_full_alpaca_zh_32k_sp8_20240424_165414
04/24 17:11:38 - mmengine - INFO - Iter(train) [32/96] lr: 1.5989e-05 eta: 0:12:16 time: 44.8117 data_time: 0.0075 memory: 7970 loss: 1.4993 tflops: 5.5519 tokens_per_sec: 19.0352
04/24 17:11:38 - mmengine - WARNING - Reach the end of the dataloader, it will be restarted and continue to iterate. It is recommended to use mmengine.dataset.InfiniteSampler
to enable the dataloader to iterate infinitely.
04/24 17:11:48 - mmengine - INFO - Iter(train) [33/96] lr: 1.5750e-05 eta: 0:12:02 time: 9.9696 data_time: 3.2989 memory: 7895 loss: 1.2497 tflops: 29.2799 tokens_per_sec: 99.4019
04/24 17:11:54 - mmengine - INFO - Iter(train) [34/96] lr: 1.5507e-05 eta: 0:11:41 time: 6.3881 data_time: 0.0065 memory: 7805 loss: 1.3924 tflops: 31.5616 tokens_per_sec: 109.4228
04/24 17:12:01 - mmengine - INFO - Iter(train) [35/96] lr: 1.5259e-05 eta: 0:11:21 time: 6.3998 data_time: 0.0062 memory: 7870 loss: 1.3473 tflops: 38.8749 tokens_per_sec: 133.2853
04/24 17:12:07 - mmengine - INFO - Iter(train) [36/96] lr: 1.5008e-05 eta: 0:11:02 time: 6.3611 data_time: 0.0078 memory: 7833 loss: 1.1118 tflops: 34.9025 tokens_per_sec: 120.4190
04/24 17:12:13 - mmengine - INFO - Iter(train) [37/96] lr: 1.4753e-05 eta: 0:10:44 time: 6.5525 data_time: 0.0073 memory: 7914 loss: 1.2245 tflops: 42.9649 tokens_per_sec: 146.2044
04/24 17:12:20 - mmengine - INFO - Iter(train) [38/96] lr: 1.4494e-05 eta: 0:10:26 time: 6.3583 data_time: 0.0055 memory: 7810 loss: 1.2365 tflops: 32.3295 tokens_per_sec: 111.9793
04/24 17:12:26 - mmengine - INFO - Iter(train) [39/96] lr: 1.4232e-05 eta: 0:10:08 time: 6.3835 data_time: 0.0062 memory: 7827 loss: 1.1699 tflops: 33.9666 tokens_per_sec: 117.3344
在这里就会oom了,我设置的epoch是3, 32步一个epoch,数据集用的是alpaca_ch
问下你设置的梯度累积值是多少呢?另外问下您的显卡显存是多少呢,我看log里打印的是小于8G的显存占用
accumulative_counts 和sequence_parallel_size是一样的值,我8和4都试过了,都是在一个epoch结束的第accumulative_counts-1个step的时候,必然oom。 另外这个oom不是显存,是内存,我内存是1t的,我的显存是40g的,但是并没有出现显存溢出。 我把样本数改小了,也是会出现一样的问题。
我确认下哈,您这边用的配置是 Yi34B + 32k seq length + sequence parallel size 4 (8) + deepspeed zero3 offload 吧?
我这边尝试复现下您的问题
嗯好的,我这边是 Yi34B + 24k seq length(12k也试过) + sequence parallel size 4 (8) + deepspeed zero3 offload,即使数据集很小也能复现,辛苦~
我复现出来您的问题了,我进行了两组实验,都使用的是 yi 34b + deepspeed zero3 offload + 8 * A100 80G (1T mem):
- 8k seq len + sequence parallel 4 + grad acc 4
- 2k seq len + sequence parallel 1 + grad acc 1
这两组实验在不使用 cpu Offload的情况下,理论上显存占用是很接近的(8k seq len / seq parallel 4 = 2k)
但是这两组实验在第 2 个 epoch 第 accumulative_counts-1 个 step 的时候均会遇到内存 OOM ,说明应该不是序列并行导致的 OOM 。
我观测了下第一个 epoch 的内存占用情况,发现由于用了cpu Offload 这两组实验的内存占用均超过90%。
8k seq len + sequence parallel 4 + grad acc 4:
2k seq len + sequence parallel 1 + grad acc 1
猜测是在从epoch 1 变更到 epoch 2 的时候,deepspeed Offload zero 3 没有及时释放一些占用的内存资源,导致在第 2 个epoch第一次更新参数的时候,cpu Offload 爆内存。具体什么原因导致没有及时释放内存资源,我还在 debug。
BTW. 目前我尝试了 16 卡的实验,发现可以正常使用cpu Offload 训练,要是方便您可以先尝试下。
好的,辛苦您帮忙排查一下~ 我这边目前还没有16卡可以用,只能先等待您的进展了。。
这应该是deepspeed0.14.1的一个bug,0.14.0不会在offload的时候出问题,deepspeed issue有人讨论
我现在用的ds版本是0.12.3,从全量改成lora,内存降低了,现在没问题了。之前试过新版的ds,但是会报最上面提到的错误。 目前的使用场景对全参微调需求不大了,我先close了,感谢各位解答~