ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: 运行 train_prompts.py prompts.csv --strategy naive 失败

Open exceedzhang opened this issue 1 year ago • 9 comments

🐛 Describe the bug

I download prompts.csv and run: python train_prompts.py prompts.csv --strategy naive --lora_rank 16 Traceback (most recent call last): File "train_prompts.py", line 122, in main(args) File "train_prompts.py", line 32, in main actor = GPTActor().cuda() File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 688, in cuda return self._apply(lambda t: t.cuda(device)) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply module._apply(fn) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply module._apply(fn) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply module._apply(fn) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 601, in _apply param_applied = fn(param) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 688, in return self._apply(lambda t: t.cuda(device)) RuntimeError: CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. root@autodl-container-98b3119d3c-f07453a6:~/autodl-tmp/ColossalAI/applications/ChatGPT/examples#

image

RuntimeError: CUDA error: out of memory 我使用 A5000 GPU 24GB 显存,训练使用GPU内存需要多少?是否我运行参数有问题?请求大家帮助!

Environment

使用ChatGPT0.1.0版本 image

exceedzhang avatar Mar 02 '23 07:03 exceedzhang

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Title: [BUG]: Failed to run train_prompts.py prompts.csv --strategy naive

Issues-translate-bot avatar Mar 02 '23 07:03 Issues-translate-bot

image

exceedzhang avatar Mar 02 '23 07:03 exceedzhang

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


image

Issues-translate-bot avatar Mar 02 '23 07:03 Issues-translate-bot

Thanks for your feedback. We suggest you to use colossalai_zero2 strategy to train instead of naive which may save GPU mem for you. You can also use train_prompt.sh for training demo.

ht-zhou avatar Mar 02 '23 10:03 ht-zhou

Thank you!I'll try again!

exceedzhang avatar Mar 03 '23 04:03 exceedzhang

@ht-zhou Could not find 'RANK' in the torch environment 这个策略需要多少的显存?我试了一下,发现了额这个报错

Traceback (most recent call last): File "train_prompts.py", line 122, in main(args) File "train_prompts.py", line 25, in main strategy = ColossalAIStrategy(stage=2, placement_policy='cuda') File "/opt/conda/lib/python3.8/site-packages/chatgpt/trainer/strategies/colossalai.py", line 77, in init super().init(seed) File "/opt/conda/lib/python3.8/site-packages/chatgpt/trainer/strategies/ddp.py", line 25, in init super().init() File "/opt/conda/lib/python3.8/site-packages/chatgpt/trainer/strategies/base.py", line 23, in init self.setup_distributed() File "/opt/conda/lib/python3.8/site-packages/chatgpt/trainer/strategies/colossalai.py", line 110, in setup_distributed colossalai.launch_from_torch({}, seed=self.seed) File "/opt/conda/lib/python3.8/site-packages/colossalai/initialize.py", line 215, in launch_from_torch raise RuntimeError( RuntimeError: Could not find 'RANK' in the torch environment, visit https://www.colossalai.org/ for more information on launching with torch root@8b69ccfeec2f:/gpt/applications/ChatGPT/examples#

chingfeng2021 avatar Mar 03 '23 09:03 chingfeng2021

Hi @JThh , can you help to answer this question?

ht-zhou avatar Mar 07 '23 02:03 ht-zhou

Rank error may be fixed by executing as torchrun rather than directly as python.

torchrun --standalone --nproc_per_node=2 train_prompts.py <Insert args here>

ac0ra avatar Apr 11 '23 04:04 ac0ra

@chingfeng2021, has the issue been resolved?

JThh avatar Apr 17 '23 09:04 JThh

We have updated a lot. Please check the latest code. This issue was closed due to inactivity. Thanks.

binmakeswell avatar Apr 27 '23 08:04 binmakeswell