ColossalAI
ColossalAI copied to clipboard
[BUG]: A100 24G,加载opt6.7b爆内存
oomkilled

默认的脚本 `set -x export BS=${BS:-16} export MEMCAP=${MEMCAP:-0} export GPUNUM=${GPUNUM:-1}
export MODLE_PATH="facebook/opt-${MODEL}" model_name_or_path=./opt6.7b
# HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1
torchrun
--nproc_per_node ${GPUNUM}
--master_port 19198
train_gemini_opt.py
--mem_cap ${MEMCAP}
--model_name_or_path ${model_name_or_path}
--batch_size ${BS} `
Environment
版本:torch1.12+cu113 deepspeed:0.7.7 内存:80G
Found a very similar issue #2758. Please try using a smaller batch size (e.g. 1).
ColossalAI通过源码安装后,运行后一直卡在这儿,
Please install apex before making the trial again. Follow the instruction here.
@JThh this is not about apex. I can take this over.
@iMountTai mostly that your cuda environment is not matching the one required by PyTorch. You can check this by
- check your system cuda version
nvcc -v
- check your pytorch's cuda version
import torch
print(torch.version.cuda)
Compilation will only run successfully if they match.
升级环境为torch1.13cu117后,解决了该问题,虽然不知道为什么:joy:
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
After upgrading the environment to torch1.13cu117, the problem was solved, although I don't know why: joy:
升级环境为torch1.13cu117后,解决了该问题,虽然不知道为什么😂
可能是你的系统CUDA是1.17 :)
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
After upgrading the environment to torch1.13cu117, the problem was solved, although I don’t know why 😂
May be your system CUDA is 1.17 :)
请问为什么每次运行时都要重构算子呢,之前的缓存不能直接加载吗?
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
May I ask why the operator needs to be reconstructed every time it runs? Can’t the previous cache be loaded directly?
请问为什么每次运行时都要重构算子呢,之前的缓存不能直接加载吗?
第一次编译之后都是重载。
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
Why do operators need to be reconstructed every time they run? Can’t the previous cache be loaded directly?
After the first compilation are overloaded.
谢谢回复如此及时。torch1.12cu113仍然存在该问题

升级环境为torch1.13cu117后,解决了该问题,虽然不知道为什么😂
可能是你的系统CUDA是1.17 :)
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
Thanks for replying so promptly. torch1.12cu113 still has this problem

After upgrading the environment to torch1.13cu117, the problem was solved, although I don’t know why 😂
May be your system CUDA is 1.17 :)
Glad to hear it was resolved. Thanks.