ColossalAI
ColossalAI copied to clipboard
[BUG]: ColossalAI/applications/ChatGPT/examples路径下代码的colossalai模型并行策略报错。
🐛 Describe the bug
以路径下train_prompts.py代码为例。
单卡可以运行python train_prompts.py --model opt --pretrain opt-125m
代码,也支持更大模型的训练;
但是利用官方自带shell脚本运行torchrun --standalone --nproc_per_node=2 train_prompts.py prompts.csv --model opt --pretrain opt-125m/ --strategy colossalai_gemini --train_batch_size 1
报错。
报错信息如下:
请问在使用脚本的时候是需要修改什么参数吗?可以如何定位并解决问题
Environment
CUDA = 11.2 Python = 3.7.3 PyTorch = 1.13.1
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
Title: [BUG]: The colossalai model parallel strategy of the code under the ColossalAI/applications/ChatGPT/examples path reports an error.
Hi @Qian0733 Thank you for your feedback. But we can't reproduce your bug. It seems like there's something wrong with your env. Can you give us more information about your bug? Thank you.(And we suggest you to update to our newest code to have a try.)
我也遇到了相同的问题,请问你解决了吗? 另外,strategy=ddp的时候可以正常运行,但是多个gpu和单个gpu运行的耗时以及显存占用却是一样的!请问你遇到过这个问题嘛?
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
I also encountered the same problem, did you solve it? In addition, when strategy=ddp, it can run normally, but the time-consuming and memory usage of multi-card and single-card operation are the same! May I ask you have encountered this problem?
我遇到执行train_sft.py的时候遇到了同样的问题
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
I had the same problem when executing train_sft.py
We have updated a lot. Please check the latest code. This issue was closed due to inactivity. Thanks.