ChatGLM2-6B [BUG/Help] <在两张A100 80G卡上执行bash ds_train_finetune.sh，batchsize=1，num_gpus=2，报了OOM，查看发现第一张卡没有用直接用了第二张卡，满了就OOM了>

[BUG/Help] <在两张A100 80G卡上执行bash ds_train_finetune.sh，batchsize=1，num_gpus=2，报了OOM，查看发现第一张卡没有用直接用了第二张卡，满了就OOM了>

Open LittleXu1998 opened this issue 1 year ago • 3 comments

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

Expected Behavior

No response

Steps To Reproduce

Environment

- OS:Ubuntu 18.04
- Python:3.9
- Transformers:4.29.2
- PyTorch:2.0.2+cu18
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :True

Anything else?

No response

Jul 07 '23 06:07 LittleXu1998

+1，我用了16张A10内存都是爆的

Jul 07 '23 06:07 traveler-vee

好像是deepspeed 并不能把这个模型变成分布式的。每个卡都加载了1遍

Jul 07 '23 08:07 liuzhipengchd

可以尝试改一下训练器看看行不行，trainer_seq2seq.py中导入： from transformers.trainer import Trainer

Jul 07 '23 09:07 lilongxian

请问你后来解决了吗？

Jul 12 '23 02:07 bigbigwatermalon

大佬请问解决了吗？

Jul 14 '23 08:07 xyannly

不好意思，前几天忙没登陆，这个问题我遇到过一次就是个人电脑有两个GPU，开始训练就是这样，后来在并行训练上折腾了许久，还是没解决，最后突然想到GLM1-6B最开始使用的也是hf tf 的Trainer(后来版本换了)，我就换成原来的，就解决了。

另外就是，加载模型的时候，先加载模型配置文件config.json，再把其作为参数用到模型加载函数上，否则内存（CPU训练时）或显存很快就会爆满。

我不知道这样能不能也解决你的问题，可以先试试

Jul 14 '23 09:07 lilongxian

可以详细解释一下怎么解决么？这里的hf tf 的trainer是什么意思，每太看懂

Jul 18 '23 02:07 1332231041

ChatGLM2-6B ChatGLM2-6B copied to clipboard

[BUG/Help] <在两张A100 80G卡上执行bash ds_train_finetune.sh，batchsize=1，num_gpus=2，报了OOM，查看发现第一张卡没有用直接用了第二张卡，满了就OOM了>

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Anything else?

ChatGLM2-6B
ChatGLM2-6B copied to clipboard