ColossalAI [BUG]: failed to run /ColossalAI/examples/language/gpt/gemini

🐛 Describe the bug

根据README.md文档安装了所需要的包，但是还是跑不起来，报错如下：

export DISTPLAN=CAI_Gemini
DISTPLAN=CAI_Gemini
export GPUNUM=1
GPUNUM=1
export TPDEGREE=1
TPDEGREE=1
export PLACEMENT=cpu
PLACEMENT=cpu
export USE_SHARD_INIT=False
USE_SHARD_INIT=False
export BATCH_SIZE=16
BATCH_SIZE=16
export MODEL_TYPE=gpt2_medium
MODEL_TYPE=gpt2_medium
export TRAIN_STEP=10
TRAIN_STEP=10
'[' False = True ']'
USE_SHARD_INIT=
mkdir -p gemini_logs
torchrun --standalone --nproc_per_node=1 ./train_gpt_demo.py --tp_degree=1 --model_type=gpt2_medium --batch_size=16 --placement=cpu --distplan=CAI_Gemini --train_step=10
tee ./gemini_logs/gpt2_medium_CAI_Gemini_gpu_1_bs_16_tp_1_cpu.log environmental variable OMP_NUM_THREADS is set to 80. [03/04/23 03:03:48] INFO colossalai - colossalai - INFO: /home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/colossalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 0 is bound to device 0
[03/04/23 03:03:50] INFO colossalai - colossalai - INFO: /home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/colossalai/context/parallel_context.py:557 set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is
ParallelMode.DATA.
INFO colossalai - colossalai - INFO: /home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/colossalai/initialize.py:116 launch
INFO colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 1, pipeline parallel size: 1, tensor parallel size: 1
INFO colossalai - colossalai - INFO: ./train_gpt_demo.py:211 main
INFO colossalai - colossalai - INFO: gpt2_medium, CAI_Gemini, batch size 16
========================================================================================= No pre-built kernel is found, build and load the cpu_adam kernel during runtime now ========================================================================================= Traceback (most recent call last): File "/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py", line 135, in load op_module = self.import_op() File "/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py", line 118, in import_op return importlib.import_module(self.prebuilt_import_path) File "/home/panz/anaconda3/envs/gpt/lib/python3.8/importlib/init.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1014, in _gcd_import File "", line 991, in _find_and_load File "", line 973, in _find_and_load_unlocked ModuleNotFoundError: No module named 'colossalai._C.cpu_adam' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "./train_gpt_demo.py", line 353, in main() File "./train_gpt_demo.py", line 255, in main optimizer = HybridAdam(model.parameters(), lr=1e-3) File "/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/colossalai/nn/optimizer/hybrid_adam.py", line 82, in init cpu_optim = CPUAdamBuilder().load() File "/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py", line 159, in load extra_include_paths=self.strip_empty_entries(self.include_dirs()), File "/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/colossalai/kernel/op_builder/cpu_adam.py", line 25, in include_dirs self.get_cuda_home_include() File "/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py", line 71, in get_cuda_home_include raise RuntimeError("CUDA_HOME is None, please set CUDA_HOME to compile C++/CUDA kernels in ColossalAI.") RuntimeError: CUDA_HOME is None, please set CUDA_HOME to compile C++/CUDA kernels in ColossalAI. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 59229) of binary: /home/panz/anaconda3/envs/gpt/bin/python Traceback (most recent call last): File "/home/panz/anaconda3/envs/gpt/bin/torchrun", line 8, in sys.exit(main()) File "/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, **kwargs) File "/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main run(args) File "/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ ./train_gpt_demo.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-03-04_03:03:56 host : zztf-gpu02 rank : 0 (local_rank: 0) exitcode : 1 (pid: 59229) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

colossalai check -i，如下：

Installation Report

------------ Environment ------------ Colossal-AI version: 0.2.5 PyTorch version: 1.12.0 CUDA version: N/A CUDA version required by PyTorch: 11.3

Note:

The table above checks the versions of the libraries/tools in the current environment
If the CUDA version is N/A, you can set the CUDA_HOME environment variable to locate it

------------ CUDA Extensions AOT Compilation ------------ Found AOT CUDA Extension: x PyTorch version used for AOT compilation: N/A CUDA version used for AOT compilation: N/A

Note:

AOT (ahead-of-time) compilation of the CUDA kernels occurs during installation when the environment varialbe CUDA_EXT=1 is set
If AOT compilation is not enabled, stay calm as the CUDA kernels can still be built during runtime

------------ Compatibility ------------ PyTorch version match: N/A System and PyTorch CUDA version match: N/A System and Colossal-AI CUDA version match: N/A

Note:

The table above checks the version compatibility of the libraries/tools in the current environment
- PyTorch version mistach: whether the PyTorch version in the current environment is compatible with the PyTorch version used for AOT compilation
- System and PyTorch CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version required by PyTorch
- System and Colossal-AI CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version used for AOT compilation

Environment

我的虚拟环境包如下,，另外gcc的版本是：7.5.0 Package Version

apex 0.1 bcrypt 4.0.1 certifi 2022.12.7 cffi 1.15.1 cfgv 3.3.1 charset-normalizer 3.0.1 click 8.1.3 colossalai 0.2.5 contexttimer 0.3.3 cryptography 39.0.2 distlib 0.3.6 fabric 3.0.0 filelock 3.9.0 huggingface-hub 0.12.1 identify 2.5.18 idna 3.4 invoke 2.0.0 markdown-it-py 2.2.0 mdurl 0.1.2 ninja 1.11.1 nodeenv 1.7.0 numpy 1.24.2 packaging 23.0 paramiko 3.0.0 Pillow 9.4.0 pip 22.3.1 platformdirs 3.0.0 pre-commit 3.1.1 psutil 5.9.4 pycparser 2.21 Pygments 2.14.0 PyNaCl 1.5.0 PyYAML 6.0 regex 2022.10.31 requests 2.28.2 rich 13.3.1 setuptools 65.6.3 tokenizers 0.13.2 torch 1.12.0+cu113 torchaudio 0.12.0+cu113 torchvision 0.13.0+cu113 tqdm 4.64.1 transformers 4.26.1 typing_extensions 4.5.0 urllib3 1.26.14 virtualenv 20.20.0 wheel 0.38.4

Mar 04 '23 03:03 pari-njupt

Can you try reinstalling PyTorch to 1.13?

Mar 04 '23 04:03 JThh

Can you try reinstalling PyTorch to 1.13?

still failed，and when i colossalai check -i ,the cuda version still N/A,

Mar 04 '23 07:03 pari-njupt

把CUDA_HOME的环境变量配置一下

Mar 04 '23 08:03 joan126

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

Configure the environment variable of CUDA_HOME

Mar 04 '23 08:03 Issues-translate-bot

把CUDA_HOME的环境变量配置一下

CUDA12.1需要退回吗？

Mar 05 '23 10:03 CatchPig51

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

Configure the environment variable of CUDA_HOME Does CUDA12.1 need to be rolled back?

Mar 05 '23 10:03 Issues-translate-bot

把CUDA_HOME的环境变量配置一下

设置好后，colossalai check -i如下：

Installation Report

------------ Environment ------------ Colossal-AI version: 0.2.5 PyTorch version: 1.13.1 System CUDA version: 9.1 CUDA version required by PyTorch: 11.7

Note:

The table above checks the versions of the libraries/tools in the current environment
If the System CUDA version is N/A, you can set the CUDA_HOME environment variable to locate it
If the CUDA version required by PyTorch is N/A, you probably did not install a CUDA-compatible PyTorch. This value is give by torch.version.cuda and you can go to https://pytorch.org/get-started/locally/ to download the correct version.

------------ CUDA Extensions AOT Compilation ------------ Found AOT CUDA Extension: x PyTorch version used for AOT compilation: N/A CUDA version used for AOT compilation: N/A

Note:

AOT (ahead-of-time) compilation of the CUDA kernels occurs during installation when the environment varialbe CUDA_EXT=1 is set
If AOT compilation is not enabled, stay calm as the CUDA kernels can still be built during runtime

------------ Compatibility ------------ PyTorch version match: N/A System and PyTorch CUDA version match: x System and Colossal-AI CUDA version match: N/A

Note:

The table above checks the version compatibility of the libraries/tools in the current environment
- PyTorch version mistach: whether the PyTorch version in the current environment is compatible with the PyTorch version used for AOT compilation
- System and PyTorch CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version required by PyTorch
- System and Colossal-AI CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version used for AOT compilation

新的报错，没有colossalai._C.cpu_adam： bash run_gemini.sh

export DISTPLAN=CAI_Gemini
DISTPLAN=CAI_Gemini
export GPUNUM=1
GPUNUM=1
export TPDEGREE=1
TPDEGREE=1
export PLACEMENT=cpu
PLACEMENT=cpu
export USE_SHARD_INIT=False
USE_SHARD_INIT=False
export BATCH_SIZE=16
BATCH_SIZE=16
export MODEL_TYPE=gpt2_medium
MODEL_TYPE=gpt2_medium
export TRAIN_STEP=10
TRAIN_STEP=10
'[' False = True ']'
USE_SHARD_INIT=
mkdir -p gemini_logs
torchrun --standalone --nproc_per_node=1 ./train_gpt_demo.py --tp_degree=1 --model_type=gpt2_medium --batch_size=16 --placement=cpu --distplan=CAI_Gemini --train_step=10
tee ./gemini_logs/gpt2_medium_CAI_Gemini_gpu_1_bs_16_tp_1_cpu.log /home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/torch/library.py:130: UserWarning: Overriding a previously registered kernel for the same operator and the same dispatch key operator: aten::index.Tensor(Tensor self, Tensor?[] indices) -> Tensor registered at aten/src/ATen/RegisterSchema.cpp:6 dispatch key: Meta previous kernel: registered at ../aten/src/ATen/functorch/BatchRulesScatterOps.cpp:1053 new kernel: registered at /dev/null:219 (Triggered internally at ../aten/src/ATen/core/dispatch/OperatorEntry.cpp:150.) self.m.impl(name, dispatch_key, fn) environmental variable OMP_NUM_THREADS is set to 80. /home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/torch/cuda/init.py:497: UserWarning: Can't initialize NVML warnings.warn("Can't initialize NVML") [03/07/23 06:11:07] INFO colossalai - colossalai - INFO:
/home/panz/anaconda3/envs/gpt/lib/python3.8/site-pa ckages/colossalai/context/parallel_context.py:521
set_device
INFO colossalai - colossalai - INFO: process rank 0 is
bound to device 0
[03/07/23 06:11:09] INFO colossalai - colossalai - INFO:
/home/panz/anaconda3/envs/gpt/lib/python3.8/site-pa ckages/colossalai/context/parallel_context.py:557
set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024,
ParallelMode.DATA: 1024, ParallelMode.TENSOR:
1024,the default parallel seed is
ParallelMode.DATA.
INFO colossalai - colossalai - INFO:
/home/panz/anaconda3/envs/gpt/lib/python3.8/site-pa ckages/colossalai/initialize.py:116 launch
INFO colossalai - colossalai - INFO: Distributed
environment is initialized, data parallel size: 1, pipeline parallel size: 1, tensor parallel size: 1 INFO colossalai - colossalai - INFO:
./train_gpt_demo.py:211 main
INFO colossalai - colossalai - INFO: gpt2_medium,
CAI_Gemini, batch size 16
Traceback (most recent call last): File "/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py", line 159, in load op_module = self.import_op() File "/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py", line 110, in import_op return importlib.import_module(self.prebuilt_import_path) File "/home/panz/anaconda3/envs/gpt/lib/python3.8/importlib/init.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1014, in _gcd_import File "", line 991, in _find_and_load File "", line 973, in _find_and_load_unlocked ModuleNotFoundError: No module named 'colossalai._C.cpu_adam' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "./train_gpt_demo.py", line 353, in main() File "./train_gpt_demo.py", line 255, in main optimizer = HybridAdam(model.parameters(), lr=1e-3) File "/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/colossalai/nn/optimizer/hybrid_adam.py", line 82, in init cpu_optim = CPUAdamBuilder().load() File "/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py", line 165, in load self.check_runtime_build_environment() File "/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py", line 139, in check_runtime_build_environment check_system_pytorch_cuda_match(CUDA_HOME) File "/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/colossalai/kernel/op_builder/utils.py", line 87, in check_system_pytorch_cuda_match raise Exception( Exception: [extension] Failed to build PyTorch extension because the detected CUDA version (9.1) mismatches the version that was used to compile PyTorch (11.7).Please make sure you have set the CUDA_HOME correctly and installed the correct PyTorch in https://pytorch.org/get-started/locally/ . ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 49444) of binary: /home/panz/anaconda3/envs/gpt/bin/python Traceback (most recent call last): File "/home/panz/anaconda3/envs/gpt/bin/torchrun", line 8, in sys.exit(main()) File "/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ ./train_gpt_demo.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-03-07_06:11:16 host : zztf-gpu02 rank : 0 (local_rank: 0) exitcode : 1 (pid: 49444) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Mar 07 '23 06:03 pari-njupt

这个问题，貌似解决了，需要将cuda升级到11.7

Mar 08 '23 05:03 pari-njupt

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

This problem seems to be solved, you need to upgrade cuda to 11.7

Mar 08 '23 05:03 Issues-translate-bot

Glad to hear it was resolved. Thanks.

Apr 27 '23 09:04 binmakeswell

ColossalAI ColossalAI copied to clipboard

[BUG]: failed to run /ColossalAI/examples/language/gpt/gemini

🐛 Describe the bug

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-03-04_03:03:56 host : zztf-gpu02 rank : 0 (local_rank: 0) exitcode : 1 (pid: 59229) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Installation Report

Environment

Installation Report

Failures: <NO_OTHER_FAILURES>

ColossalAI
ColossalAI copied to clipboard