ColossalAI
ColossalAI copied to clipboard
[BUG]: failed to run /ColossalAI/examples/language/gpt/gemini
🐛 Describe the bug
根据README.md文档安装了所需要的包,但是还是跑不起来,报错如下:
- export DISTPLAN=CAI_Gemini
- DISTPLAN=CAI_Gemini
- export GPUNUM=1
- GPUNUM=1
- export TPDEGREE=1
- TPDEGREE=1
- export PLACEMENT=cpu
- PLACEMENT=cpu
- export USE_SHARD_INIT=False
- USE_SHARD_INIT=False
- export BATCH_SIZE=16
- BATCH_SIZE=16
- export MODEL_TYPE=gpt2_medium
- MODEL_TYPE=gpt2_medium
- export TRAIN_STEP=10
- TRAIN_STEP=10
- '[' False = True ']'
- USE_SHARD_INIT=
- mkdir -p gemini_logs
- torchrun --standalone --nproc_per_node=1 ./train_gpt_demo.py --tp_degree=1 --model_type=gpt2_medium --batch_size=16 --placement=cpu --distplan=CAI_Gemini --train_step=10
- tee ./gemini_logs/gpt2_medium_CAI_Gemini_gpu_1_bs_16_tp_1_cpu.log
environmental variable OMP_NUM_THREADS is set to 80.
[03/04/23 03:03:48] INFO colossalai - colossalai - INFO: /home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/colossalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 0 is bound to device 0
[03/04/23 03:03:50] INFO colossalai - colossalai - INFO: /home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/colossalai/context/parallel_context.py:557 set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is
ParallelMode.DATA.
INFO colossalai - colossalai - INFO: /home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/colossalai/initialize.py:116 launch
INFO colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 1, pipeline parallel size: 1, tensor parallel size: 1
INFO colossalai - colossalai - INFO: ./train_gpt_demo.py:211 main
INFO colossalai - colossalai - INFO: gpt2_medium, CAI_Gemini, batch size 16
========================================================================================= No pre-built kernel is found, build and load the cpu_adam kernel during runtime now ========================================================================================= Traceback (most recent call last): File "/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py", line 135, in load op_module = self.import_op() File "/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py", line 118, in import_op return importlib.import_module(self.prebuilt_import_path) File "/home/panz/anaconda3/envs/gpt/lib/python3.8/importlib/init.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1014, in _gcd_import File " ", line 991, in _find_and_load File " ", line 973, in _find_and_load_unlocked ModuleNotFoundError: No module named 'colossalai._C.cpu_adam' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "./train_gpt_demo.py", line 353, in main() File "./train_gpt_demo.py", line 255, in main optimizer = HybridAdam(model.parameters(), lr=1e-3) File "/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/colossalai/nn/optimizer/hybrid_adam.py", line 82, in init cpu_optim = CPUAdamBuilder().load() File "/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py", line 159, in load extra_include_paths=self.strip_empty_entries(self.include_dirs()), File "/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/colossalai/kernel/op_builder/cpu_adam.py", line 25, in include_dirs self.get_cuda_home_include() File "/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py", line 71, in get_cuda_home_include raise RuntimeError("CUDA_HOME is None, please set CUDA_HOME to compile C++/CUDA kernels in ColossalAI.") RuntimeError: CUDA_HOME is None, please set CUDA_HOME to compile C++/CUDA kernels in ColossalAI. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 59229) of binary: /home/panz/anaconda3/envs/gpt/bin/python Traceback (most recent call last): File "/home/panz/anaconda3/envs/gpt/bin/torchrun", line 8, in sys.exit(main()) File "/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, **kwargs) File "/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main run(args) File "/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ ./train_gpt_demo.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2023-03-04_03:03:56 host : zztf-gpu02 rank : 0 (local_rank: 0) exitcode : 1 (pid: 59229) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
colossalai check -i,如下:
Installation Report
------------ Environment ------------ Colossal-AI version: 0.2.5 PyTorch version: 1.12.0 CUDA version: N/A CUDA version required by PyTorch: 11.3
Note:
- The table above checks the versions of the libraries/tools in the current environment
- If the CUDA version is N/A, you can set the CUDA_HOME environment variable to locate it
------------ CUDA Extensions AOT Compilation ------------ Found AOT CUDA Extension: x PyTorch version used for AOT compilation: N/A CUDA version used for AOT compilation: N/A
Note:
- AOT (ahead-of-time) compilation of the CUDA kernels occurs during installation when the environment varialbe CUDA_EXT=1 is set
- If AOT compilation is not enabled, stay calm as the CUDA kernels can still be built during runtime
------------ Compatibility ------------ PyTorch version match: N/A System and PyTorch CUDA version match: N/A System and Colossal-AI CUDA version match: N/A
Note:
- The table above checks the version compatibility of the libraries/tools in the current environment
- PyTorch version mistach: whether the PyTorch version in the current environment is compatible with the PyTorch version used for AOT compilation
- System and PyTorch CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version required by PyTorch
- System and Colossal-AI CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version used for AOT compilation
Environment
我的虚拟环境包如下,,另外gcc的版本是:7.5.0 Package Version
apex 0.1 bcrypt 4.0.1 certifi 2022.12.7 cffi 1.15.1 cfgv 3.3.1 charset-normalizer 3.0.1 click 8.1.3 colossalai 0.2.5 contexttimer 0.3.3 cryptography 39.0.2 distlib 0.3.6 fabric 3.0.0 filelock 3.9.0 huggingface-hub 0.12.1 identify 2.5.18 idna 3.4 invoke 2.0.0 markdown-it-py 2.2.0 mdurl 0.1.2 ninja 1.11.1 nodeenv 1.7.0 numpy 1.24.2 packaging 23.0 paramiko 3.0.0 Pillow 9.4.0 pip 22.3.1 platformdirs 3.0.0 pre-commit 3.1.1 psutil 5.9.4 pycparser 2.21 Pygments 2.14.0 PyNaCl 1.5.0 PyYAML 6.0 regex 2022.10.31 requests 2.28.2 rich 13.3.1 setuptools 65.6.3 tokenizers 0.13.2 torch 1.12.0+cu113 torchaudio 0.12.0+cu113 torchvision 0.13.0+cu113 tqdm 4.64.1 transformers 4.26.1 typing_extensions 4.5.0 urllib3 1.26.14 virtualenv 20.20.0 wheel 0.38.4
Can you try reinstalling PyTorch to 1.13?
Can you try reinstalling PyTorch to 1.13?
still failed,and when i colossalai check -i
,the cuda version still N/A,
把CUDA_HOME的环境变量配置一下
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
Configure the environment variable of CUDA_HOME
把CUDA_HOME的环境变量配置一下
CUDA12.1需要退回吗?
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
Configure the environment variable of CUDA_HOME Does CUDA12.1 need to be rolled back?
把CUDA_HOME的环境变量配置一下
设置好后,colossalai check -i如下:
Installation Report
------------ Environment ------------ Colossal-AI version: 0.2.5 PyTorch version: 1.13.1 System CUDA version: 9.1 CUDA version required by PyTorch: 11.7
Note:
- The table above checks the versions of the libraries/tools in the current environment
- If the System CUDA version is N/A, you can set the CUDA_HOME environment variable to locate it
- If the CUDA version required by PyTorch is N/A, you probably did not install a CUDA-compatible PyTorch. This value is give by torch.version.cuda and you can go to https://pytorch.org/get-started/locally/ to download the correct version.
------------ CUDA Extensions AOT Compilation ------------ Found AOT CUDA Extension: x PyTorch version used for AOT compilation: N/A CUDA version used for AOT compilation: N/A
Note:
- AOT (ahead-of-time) compilation of the CUDA kernels occurs during installation when the environment varialbe CUDA_EXT=1 is set
- If AOT compilation is not enabled, stay calm as the CUDA kernels can still be built during runtime
------------ Compatibility ------------ PyTorch version match: N/A System and PyTorch CUDA version match: x System and Colossal-AI CUDA version match: N/A
Note:
- The table above checks the version compatibility of the libraries/tools in the current environment
- PyTorch version mistach: whether the PyTorch version in the current environment is compatible with the PyTorch version used for AOT compilation
- System and PyTorch CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version required by PyTorch
- System and Colossal-AI CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version used for AOT compilation
新的报错,没有colossalai._C.cpu_adam: bash run_gemini.sh
- export DISTPLAN=CAI_Gemini
- DISTPLAN=CAI_Gemini
- export GPUNUM=1
- GPUNUM=1
- export TPDEGREE=1
- TPDEGREE=1
- export PLACEMENT=cpu
- PLACEMENT=cpu
- export USE_SHARD_INIT=False
- USE_SHARD_INIT=False
- export BATCH_SIZE=16
- BATCH_SIZE=16
- export MODEL_TYPE=gpt2_medium
- MODEL_TYPE=gpt2_medium
- export TRAIN_STEP=10
- TRAIN_STEP=10
- '[' False = True ']'
- USE_SHARD_INIT=
- mkdir -p gemini_logs
- torchrun --standalone --nproc_per_node=1 ./train_gpt_demo.py --tp_degree=1 --model_type=gpt2_medium --batch_size=16 --placement=cpu --distplan=CAI_Gemini --train_step=10
- tee ./gemini_logs/gpt2_medium_CAI_Gemini_gpu_1_bs_16_tp_1_cpu.log
/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/torch/library.py:130: UserWarning: Overriding a previously registered kernel for the same operator and the same dispatch key
operator: aten::index.Tensor(Tensor self, Tensor?[] indices) -> Tensor
registered at aten/src/ATen/RegisterSchema.cpp:6
dispatch key: Meta
previous kernel: registered at ../aten/src/ATen/functorch/BatchRulesScatterOps.cpp:1053
new kernel: registered at /dev/null:219 (Triggered internally at ../aten/src/ATen/core/dispatch/OperatorEntry.cpp:150.)
self.m.impl(name, dispatch_key, fn)
environmental variable OMP_NUM_THREADS is set to 80.
/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/torch/cuda/init.py:497: UserWarning: Can't initialize NVML
warnings.warn("Can't initialize NVML")
[03/07/23 06:11:07] INFO colossalai - colossalai - INFO:
/home/panz/anaconda3/envs/gpt/lib/python3.8/site-pa ckages/colossalai/context/parallel_context.py:521
set_device
INFO colossalai - colossalai - INFO: process rank 0 is
bound to device 0
[03/07/23 06:11:09] INFO colossalai - colossalai - INFO:
/home/panz/anaconda3/envs/gpt/lib/python3.8/site-pa ckages/colossalai/context/parallel_context.py:557
set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024,
ParallelMode.DATA: 1024, ParallelMode.TENSOR:
1024,the default parallel seed is
ParallelMode.DATA.
INFO colossalai - colossalai - INFO:
/home/panz/anaconda3/envs/gpt/lib/python3.8/site-pa ckages/colossalai/initialize.py:116 launch
INFO colossalai - colossalai - INFO: Distributed
environment is initialized, data parallel size: 1, pipeline parallel size: 1, tensor parallel size: 1 INFO colossalai - colossalai - INFO:
./train_gpt_demo.py:211 main
INFO colossalai - colossalai - INFO: gpt2_medium,
CAI_Gemini, batch size 16
Traceback (most recent call last): File "/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py", line 159, in load op_module = self.import_op() File "/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py", line 110, in import_op return importlib.import_module(self.prebuilt_import_path) File "/home/panz/anaconda3/envs/gpt/lib/python3.8/importlib/init.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1014, in _gcd_import File " ", line 991, in _find_and_load File " ", line 973, in _find_and_load_unlocked ModuleNotFoundError: No module named 'colossalai._C.cpu_adam' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "./train_gpt_demo.py", line 353, in main() File "./train_gpt_demo.py", line 255, in main optimizer = HybridAdam(model.parameters(), lr=1e-3) File "/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/colossalai/nn/optimizer/hybrid_adam.py", line 82, in init cpu_optim = CPUAdamBuilder().load() File "/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py", line 165, in load self.check_runtime_build_environment() File "/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py", line 139, in check_runtime_build_environment check_system_pytorch_cuda_match(CUDA_HOME) File "/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/colossalai/kernel/op_builder/utils.py", line 87, in check_system_pytorch_cuda_match raise Exception( Exception: [extension] Failed to build PyTorch extension because the detected CUDA version (9.1) mismatches the version that was used to compile PyTorch (11.7).Please make sure you have set the CUDA_HOME correctly and installed the correct PyTorch in https://pytorch.org/get-started/locally/ . ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 49444) of binary: /home/panz/anaconda3/envs/gpt/bin/python Traceback (most recent call last): File "/home/panz/anaconda3/envs/gpt/bin/torchrun", line 8, in sys.exit(main()) File "/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/panz/anaconda3/envs/gpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ ./train_gpt_demo.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2023-03-07_06:11:16 host : zztf-gpu02 rank : 0 (local_rank: 0) exitcode : 1 (pid: 49444) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
这个问题,貌似解决了,需要将cuda升级到11.7
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
This problem seems to be solved, you need to upgrade cuda to 11.7
Glad to hear it was resolved. Thanks.