ColossalAI [BUG]: A100 24G，加载opt6.7b爆内存

trafficstars

oomkilled

默认的脚本 `set -x export BS=${BS:-16} export MEMCAP=${MEMCAP:-0} export GPUNUM=${GPUNUM:-1}

export MODLE_PATH="facebook/opt-${MODEL}" model_name_or_path=./opt6.7b

# HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 torchrun
--nproc_per_node ${GPUNUM}
--master_port 19198
train_gemini_opt.py
--mem_cap ${MEMCAP}
--model_name_or_path ${model_name_or_path}
--batch_size ${BS} `

Environment

版本：torch1.12+cu113 deepspeed:0.7.7 内存：80G

Feb 16 '23 12:02 iMountTai

Found a very similar issue #2758. Please try using a smaller batch size (e.g. 1).

Feb 17 '23 13:02 JThh

ColossalAI通过源码安装后，运行后一直卡在这儿，

Feb 20 '23 12:02 iMountTai

Please install apex before making the trial again. Follow the instruction here.

Feb 21 '23 05:02 JThh

@JThh this is not about apex. I can take this over.

Feb 21 '23 05:02 FrankLeeeee

@iMountTai mostly that your cuda environment is not matching the one required by PyTorch. You can check this by

check your system cuda version

nvcc -v

check your pytorch's cuda version

import torch
print(torch.version.cuda)

Compilation will only run successfully if they match.

Feb 21 '23 06:02 FrankLeeeee

升级环境为torch1.13cu117后，解决了该问题，虽然不知道为什么:joy:

Feb 21 '23 08:02 iMountTai

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

After upgrading the environment to torch1.13cu117, the problem was solved, although I don't know why: joy:

Feb 21 '23 08:02 Issues-translate-bot

升级环境为torch1.13cu117后，解决了该问题，虽然不知道为什么😂

可能是你的系统CUDA是1.17 :)

Feb 21 '23 08:02 FrankLeeeee

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

After upgrading the environment to torch1.13cu117, the problem was solved, although I don’t know why 😂

May be your system CUDA is 1.17 :)

Feb 21 '23 08:02 Issues-translate-bot

请问为什么每次运行时都要重构算子呢，之前的缓存不能直接加载吗？

Feb 21 '23 08:02 iMountTai

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

May I ask why the operator needs to be reconstructed every time it runs? Can’t the previous cache be loaded directly?

Feb 21 '23 08:02 Issues-translate-bot

请问为什么每次运行时都要重构算子呢，之前的缓存不能直接加载吗？

第一次编译之后都是重载。

Feb 21 '23 08:02 FrankLeeeee

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

Why do operators need to be reconstructed every time they run? Can’t the previous cache be loaded directly?

After the first compilation are overloaded.

Feb 21 '23 08:02 Issues-translate-bot

谢谢回复如此及时。torch1.12cu113仍然存在该问题

升级环境为torch1.13cu117后，解决了该问题，虽然不知道为什么😂

可能是你的系统CUDA是1.17 :)

Feb 21 '23 09:02 iMountTai

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

Thanks for replying so promptly. torch1.12cu113 still has this problem

After upgrading the environment to torch1.13cu117, the problem was solved, although I don’t know why 😂

May be your system CUDA is 1.17 :)

Feb 21 '23 09:02 Issues-translate-bot

Glad to hear it was resolved. Thanks.

Apr 18 '23 11:04 binmakeswell

ColossalAI ColossalAI copied to clipboard

[BUG]: A100 24G，加载opt6.7b爆内存

Environment

ColossalAI
ColossalAI copied to clipboard