🐛 Describe the bug

根据README.md文档安装环境，但是跑不起来。报错如下：

No pre-built kernel is found, build and load the cpu_adam kernel during runtime now

Traceback (most recent call last): File "/home/panz/anaconda3/envs/chatgpt/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py", line 135, in load op_module = self.import_op() File "/home/panz/anaconda3/envs/chatgpt/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py", line 118, in import_op return importlib.import_module(self.prebuilt_import_path) File "/home/panz/anaconda3/envs/chatgpt/lib/python3.8/importlib/init.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1014, in _gcd_import File "", line 991, in _find_and_load File "", line 973, in _find_and_load_unlocked ModuleNotFoundError: No module named 'colossalai._C.cpu_adam'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "./train_bert_demo.py", line 332, in main() File "./train_bert_demo.py", line 231, in main optimizer = HybridAdam(model.parameters(), lr=1e-3) File "/home/panz/anaconda3/envs/chatgpt/lib/python3.8/site-packages/colossalai/nn/optimizer/hybrid_adam.py", line 82, in init cpu_optim = CPUAdamBuilder().load() File "/home/panz/anaconda3/envs/chatgpt/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py", line 159, in load extra_include_paths=self.strip_empty_entries(self.include_dirs()), File "/home/panz/anaconda3/envs/chatgpt/lib/python3.8/site-packages/colossalai/kernel/op_builder/cpu_adam.py", line 25, in include_dirs self.get_cuda_home_include() File "/home/panz/anaconda3/envs/chatgpt/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py", line 71, in get_cuda_home_include raise RuntimeError("CUDA_HOME is None, please set CUDA_HOME to compile C++/CUDA kernels in ColossalAI.") RuntimeError: CUDA_HOME is None, please set CUDA_HOME to compile C++/CUDA kernels in ColossalAI. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 37184) of binary: /home/panz/anaconda3/envs/chatgpt/bin/python Traceback (most recent call last): File "/home/panz/anaconda3/envs/chatgpt/bin/torchrun", line 8, in sys.exit(main()) File "/home/panz/anaconda3/envs/chatgpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, kwargs) File "/home/panz/anaconda3/envs/chatgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/home/panz/anaconda3/envs/chatgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/home/panz/anaconda3/envs/chatgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/home/panz/anaconda3/envs/chatgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./train_bert_demo.py FAILED ./train_bert_demo.py FAILED

Failures: <NO_OTHER_FAILURES>

Environment

torch1.13.1 transformers4.26.1 colossalai0.2.5

Mar 02 '23 09:03 zp2459

运行chatGPT项目时，一直停留在 No pre-built kernel is found, build and load the cpu_adam kernel during runtime now 显存显示已经载入模型，但没有任何进展。请问是否与内网无法连接网络有关？

Mar 03 '23 02:03 pilipala818

能看一下colossalai check -i吗？

Mar 03 '23 02:03 FrankLeeeee

Installation Report

------------ Environment ------------ Colossal-AI version: 0.2.5 PyTorch version: 1.12.0 CUDA version: N/A CUDA version required by PyTorch: 11.3

Note:

The table above checks the versions of the libraries/tools in the current environment
If the CUDA version is N/A, you can set the CUDA_HOME environment variable to locate it

------------ CUDA Extensions AOT Compilation ------------ Found AOT CUDA Extension: x PyTorch version used for AOT compilation: N/A CUDA version used for AOT compilation: N/A

Note:

AOT (ahead-of-time) compilation of the CUDA kernels occurs during installation when the environment varialbe CUDA_EXT=1 is set
If AOT compilation is not enabled, stay calm as the CUDA kernels can still be built during runtime

------------ Compatibility ------------ PyTorch version match: N/A System and PyTorch CUDA version match: N/A System and Colossal-AI CUDA version match: N/A

Note:

The table above checks the version compatibility of the libraries/tools in the current environment
- PyTorch version mistach: whether the PyTorch version in the current environment is compatible with the PyTorch version used for AOT compilation
- System and PyTorch CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version required by PyTorch
- System and Colossal-AI CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version used for AOT compilation

Mar 03 '23 02:03 zp2459

Colossal-AI version: 0.2.5 PyTorch version: 1.13.1 CUDA version: 10.1 CUDA version required by PyTorch: 11.7

Note:

The table above checks the versions of the libraries/tools in the current environment If the CUDA version is N/A, you can set the CUDA_HOME environment variable to locate it ------------ CUDA Extensions AOT Compilation ------------ Found AOT CUDA Extension: x PyTorch version used for AOT compilation: N/A CUDA version used for AOT compilation: N/A

Note:

AOT (ahead-of-time) compilation of the CUDA kernels occurs during installation when the environment varialbe CUDA_EXT=1 is set If AOT compilation is not enabled, stay calm as the CUDA kernels can still be built during runtime ------------ Compatibility ------------ PyTorch version match: N/A System and PyTorch CUDA version match: × System and Colossal-AI CUDA version match: N/A

Note:

The table above checks the version compatibility of the libraries/tools in the current environment PyTorch version mistach: whether the PyTorch version in the current environment is compatible with the PyTorch version used for AOT compilation System and PyTorch CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version required by PyTorch System and Colossal-AI CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version used for AOT compilation

Mar 03 '23 03:03 pilipala818

cuda 10.1 torch 1.10.0 torchvision 0.11.1 It works

Mar 03 '23 06:03 pilipala818

你是解决了吗？我能添加一个你的联系方式吗，我第一次接触这类项目，有些问题问您

Mar 03 '23 06:03 zp2459

Have you solved it yet? Can I add your contact information, this is my first contact with this kind of project, I have some questions for you

Mar 03 '23 06:03 Issues-translate-bot

看下gcc版本，提升一下gcc版本

Mar 03 '23 13:03 joan126

Look at the gcc version, upgrade the gcc version

Mar 03 '23 13:03 Issues-translate-bot

gcc版本7.5.0

Mar 03 '23 13:03 zp2459

gcc version 7.5.0

Mar 03 '23 13:03 Issues-translate-bot

colossalai用源码安装一下呢

Mar 04 '23 08:03 joan126

colossalai install it with source code

Mar 04 '23 08:03 Issues-translate-bot

没有下文了吗？？？这个

Mar 07 '23 10:03 Cloopen-ReLiNK

Is there no more text? ? ? this

Mar 07 '23 10:03 Issues-translate-bot

遇到同样的问题

Mar 07 '23 10:03 Cloopen-ReLiNK

encountered the same problem

Mar 07 '23 10:03 Issues-translate-bot

这个解决了吗？

Mar 27 '23 08:03 wenzezhang

Is this solved?

Mar 27 '23 08:03 Issues-translate-bot

Hi, sorry for getting to this late.

Would this issue https://github.com/hpcaitech/ColossalAI/issues/3496 be any helpful?

Apr 13 '23 07:04 JThh

We have updated a lot. Please check the latest code. This issue was closed due to inactivity. Thanks.

May 05 '23 04:05 binmakeswell

ColossalAI
ColossalAI copied to clipboard

[BUG]: failed to run ..ColossalAI/examples/language/gpt/gemini

🐛 Describe the bug

根据README.md文档安装环境，但是跑不起来。报错如下：

No pre-built kernel is found, build and load the cpu_adam kernel during runtime now

./train_bert_demo.py FAILED ./train_bert_demo.py FAILED

Failures: <NO_OTHER_FAILURES>

Environment

Installation Report

ColossalAI ColossalAI copied to clipboard

[BUG]: failed to run ..ColossalAI/examples/language/gpt/gemini

🐛 Describe the bug

根据README.md文档安装环境，但是跑不起来。报错如下：

No pre-built kernel is found, build and load the cpu_adam kernel during runtime now

./train_bert_demo.py FAILED ./train_bert_demo.py FAILED

Failures: <NO_OTHER_FAILURES>

Environment

Installation Report

ColossalAI
ColossalAI copied to clipboard