ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: A100 24G,加载opt6.7b爆内存

Open iMountTai opened this issue 2 years ago • 15 comments
trafficstars

oomkilled image

默认的脚本 `set -x export BS=${BS:-16} export MEMCAP=${MEMCAP:-0} export GPUNUM=${GPUNUM:-1}

export MODLE_PATH="facebook/opt-${MODEL}" model_name_or_path=./opt6.7b

# HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 torchrun
--nproc_per_node ${GPUNUM}
--master_port 19198
train_gemini_opt.py
--mem_cap ${MEMCAP}
--model_name_or_path ${model_name_or_path}
--batch_size ${BS} `

Environment

版本:torch1.12+cu113 deepspeed:0.7.7 内存:80G

iMountTai avatar Feb 16 '23 12:02 iMountTai

Found a very similar issue #2758. Please try using a smaller batch size (e.g. 1).

JThh avatar Feb 17 '23 13:02 JThh

image ColossalAI通过源码安装后,运行后一直卡在这儿,

iMountTai avatar Feb 20 '23 12:02 iMountTai

Please install apex before making the trial again. Follow the instruction here.

JThh avatar Feb 21 '23 05:02 JThh

@JThh this is not about apex. I can take this over.

FrankLeeeee avatar Feb 21 '23 05:02 FrankLeeeee

@iMountTai mostly that your cuda environment is not matching the one required by PyTorch. You can check this by

  1. check your system cuda version
nvcc -v
  1. check your pytorch's cuda version
import torch
print(torch.version.cuda)

Compilation will only run successfully if they match.

FrankLeeeee avatar Feb 21 '23 06:02 FrankLeeeee

升级环境为torch1.13cu117后,解决了该问题,虽然不知道为什么:joy:

iMountTai avatar Feb 21 '23 08:02 iMountTai

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


After upgrading the environment to torch1.13cu117, the problem was solved, although I don't know why: joy:

Issues-translate-bot avatar Feb 21 '23 08:02 Issues-translate-bot

升级环境为torch1.13cu117后,解决了该问题,虽然不知道为什么😂

可能是你的系统CUDA是1.17 :)

FrankLeeeee avatar Feb 21 '23 08:02 FrankLeeeee

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


After upgrading the environment to torch1.13cu117, the problem was solved, although I don’t know why 😂

May be your system CUDA is 1.17 :)

Issues-translate-bot avatar Feb 21 '23 08:02 Issues-translate-bot

请问为什么每次运行时都要重构算子呢,之前的缓存不能直接加载吗?

iMountTai avatar Feb 21 '23 08:02 iMountTai

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


May I ask why the operator needs to be reconstructed every time it runs? Can’t the previous cache be loaded directly?

Issues-translate-bot avatar Feb 21 '23 08:02 Issues-translate-bot

请问为什么每次运行时都要重构算子呢,之前的缓存不能直接加载吗?

第一次编译之后都是重载。

FrankLeeeee avatar Feb 21 '23 08:02 FrankLeeeee

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Why do operators need to be reconstructed every time they run? Can’t the previous cache be loaded directly?

After the first compilation are overloaded.

Issues-translate-bot avatar Feb 21 '23 08:02 Issues-translate-bot

谢谢回复如此及时。torch1.12cu113仍然存在该问题 image

升级环境为torch1.13cu117后,解决了该问题,虽然不知道为什么😂

可能是你的系统CUDA是1.17 :)

iMountTai avatar Feb 21 '23 09:02 iMountTai

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Thanks for replying so promptly. torch1.12cu113 still has this problem image

After upgrading the environment to torch1.13cu117, the problem was solved, although I don’t know why 😂

May be your system CUDA is 1.17 :)

Issues-translate-bot avatar Feb 21 '23 09:02 Issues-translate-bot

Glad to hear it was resolved. Thanks.

binmakeswell avatar Apr 18 '23 11:04 binmakeswell