ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: failed to run ..ColossalAI/examples/language/gpt/gemini

Open zp2459 opened this issue 1 year ago • 17 comments

🐛 Describe the bug

根据README.md文档安装环境,但是跑不起来。报错如下:

No pre-built kernel is found, build and load the cpu_adam kernel during runtime now

Traceback (most recent call last): File "/home/panz/anaconda3/envs/chatgpt/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py", line 135, in load op_module = self.import_op() File "/home/panz/anaconda3/envs/chatgpt/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py", line 118, in import_op return importlib.import_module(self.prebuilt_import_path) File "/home/panz/anaconda3/envs/chatgpt/lib/python3.8/importlib/init.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1014, in _gcd_import File "", line 991, in _find_and_load File "", line 973, in _find_and_load_unlocked ModuleNotFoundError: No module named 'colossalai._C.cpu_adam'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "./train_bert_demo.py", line 332, in main() File "./train_bert_demo.py", line 231, in main optimizer = HybridAdam(model.parameters(), lr=1e-3) File "/home/panz/anaconda3/envs/chatgpt/lib/python3.8/site-packages/colossalai/nn/optimizer/hybrid_adam.py", line 82, in init cpu_optim = CPUAdamBuilder().load() File "/home/panz/anaconda3/envs/chatgpt/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py", line 159, in load extra_include_paths=self.strip_empty_entries(self.include_dirs()), File "/home/panz/anaconda3/envs/chatgpt/lib/python3.8/site-packages/colossalai/kernel/op_builder/cpu_adam.py", line 25, in include_dirs self.get_cuda_home_include() File "/home/panz/anaconda3/envs/chatgpt/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py", line 71, in get_cuda_home_include raise RuntimeError("CUDA_HOME is None, please set CUDA_HOME to compile C++/CUDA kernels in ColossalAI.") RuntimeError: CUDA_HOME is None, please set CUDA_HOME to compile C++/CUDA kernels in ColossalAI. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 37184) of binary: /home/panz/anaconda3/envs/chatgpt/bin/python Traceback (most recent call last): File "/home/panz/anaconda3/envs/chatgpt/bin/torchrun", line 8, in sys.exit(main()) File "/home/panz/anaconda3/envs/chatgpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/home/panz/anaconda3/envs/chatgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/home/panz/anaconda3/envs/chatgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/home/panz/anaconda3/envs/chatgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/panz/anaconda3/envs/chatgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./train_bert_demo.py FAILED ./train_bert_demo.py FAILED

Failures: <NO_OTHER_FAILURES>

Environment

torch1.13.1 transformers4.26.1 colossalai0.2.5

zp2459 avatar Mar 02 '23 09:03 zp2459

运行chatGPT项目时,一直停留在 No pre-built kernel is found, build and load the cpu_adam kernel during runtime now 显存显示已经载入模型,但没有任何进展。请问是否与内网无法连接网络有关?

pilipala818 avatar Mar 03 '23 02:03 pilipala818

运行chatGPT项目时,一直停留在 No pre-built kernel is found, build and load the cpu_adam kernel during runtime now 显存显示已经载入模型,但没有任何进展。请问是否与内网无法连接网络有关?

能看一下colossalai check -i吗?

FrankLeeeee avatar Mar 03 '23 02:03 FrankLeeeee

运行chatGPT项目时,一直停留在 No pre-built kernel is found, build and load the cpu_adam kernel during runtime now 显存显示已经载入模型,但没有任何进展。请问是否与内网无法连接网络有关?

能看一下colossalai check -i吗?

Installation Report

------------ Environment ------------ Colossal-AI version: 0.2.5 PyTorch version: 1.12.0 CUDA version: N/A CUDA version required by PyTorch: 11.3

Note:

  1. The table above checks the versions of the libraries/tools in the current environment
  2. If the CUDA version is N/A, you can set the CUDA_HOME environment variable to locate it

------------ CUDA Extensions AOT Compilation ------------ Found AOT CUDA Extension: x PyTorch version used for AOT compilation: N/A CUDA version used for AOT compilation: N/A

Note:

  1. AOT (ahead-of-time) compilation of the CUDA kernels occurs during installation when the environment varialbe CUDA_EXT=1 is set
  2. If AOT compilation is not enabled, stay calm as the CUDA kernels can still be built during runtime

------------ Compatibility ------------ PyTorch version match: N/A System and PyTorch CUDA version match: N/A System and Colossal-AI CUDA version match: N/A

Note:

  1. The table above checks the version compatibility of the libraries/tools in the current environment
    • PyTorch version mistach: whether the PyTorch version in the current environment is compatible with the PyTorch version used for AOT compilation
    • System and PyTorch CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version required by PyTorch
    • System and Colossal-AI CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version used for AOT compilation

zp2459 avatar Mar 03 '23 02:03 zp2459

Colossal-AI version: 0.2.5 PyTorch version: 1.13.1 CUDA version: 10.1 CUDA version required by PyTorch: 11.7

Note:

The table above checks the versions of the libraries/tools in the current environment If the CUDA version is N/A, you can set the CUDA_HOME environment variable to locate it ------------ CUDA Extensions AOT Compilation ------------ Found AOT CUDA Extension: x PyTorch version used for AOT compilation: N/A CUDA version used for AOT compilation: N/A

Note:

AOT (ahead-of-time) compilation of the CUDA kernels occurs during installation when the environment varialbe CUDA_EXT=1 is set If AOT compilation is not enabled, stay calm as the CUDA kernels can still be built during runtime ------------ Compatibility ------------ PyTorch version match: N/A System and PyTorch CUDA version match: × System and Colossal-AI CUDA version match: N/A

Note:

The table above checks the version compatibility of the libraries/tools in the current environment PyTorch version mistach: whether the PyTorch version in the current environment is compatible with the PyTorch version used for AOT compilation System and PyTorch CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version required by PyTorch System and Colossal-AI CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version used for AOT compilation

pilipala818 avatar Mar 03 '23 03:03 pilipala818

cuda 10.1 torch 1.10.0 torchvision 0.11.1 It works

pilipala818 avatar Mar 03 '23 06:03 pilipala818

cuda 10.1 torch 1.10.0 torchvision 0.11.1 It works

你是解决了吗?我能添加一个你的联系方式吗,我第一次接触这类项目,有些问题问您

zp2459 avatar Mar 03 '23 06:03 zp2459

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


cuda 10.1 torch 1.10.0 torchvision 0.11.1 It works

Have you solved it yet? Can I add your contact information, this is my first contact with this kind of project, I have some questions for you

Issues-translate-bot avatar Mar 03 '23 06:03 Issues-translate-bot

看下gcc版本,提升一下gcc版本

joan126 avatar Mar 03 '23 13:03 joan126

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Look at the gcc version, upgrade the gcc version

Issues-translate-bot avatar Mar 03 '23 13:03 Issues-translate-bot

gcc版本7.5.0

zp2459 avatar Mar 03 '23 13:03 zp2459

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


gcc version 7.5.0

Issues-translate-bot avatar Mar 03 '23 13:03 Issues-translate-bot

colossalai用源码安装一下呢

joan126 avatar Mar 04 '23 08:03 joan126

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


colossalai install it with source code

Issues-translate-bot avatar Mar 04 '23 08:03 Issues-translate-bot

没有下文了吗???这个

Cloopen-ReLiNK avatar Mar 07 '23 10:03 Cloopen-ReLiNK

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Is there no more text? ? ? this

Issues-translate-bot avatar Mar 07 '23 10:03 Issues-translate-bot

遇到同样的问题

Cloopen-ReLiNK avatar Mar 07 '23 10:03 Cloopen-ReLiNK

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


encountered the same problem

Issues-translate-bot avatar Mar 07 '23 10:03 Issues-translate-bot

这个解决了吗?

wenzezhang avatar Mar 27 '23 08:03 wenzezhang

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Is this solved?

Issues-translate-bot avatar Mar 27 '23 08:03 Issues-translate-bot

Hi, sorry for getting to this late.

Would this issue https://github.com/hpcaitech/ColossalAI/issues/3496 be any helpful?

JThh avatar Apr 13 '23 07:04 JThh

We have updated a lot. Please check the latest code. This issue was closed due to inactivity. Thanks.

binmakeswell avatar May 05 '23 04:05 binmakeswell