ColossalAI
ColossalAI copied to clipboard
[BUG]: failed to run ..ColossalAI/examples/language/gpt/gemini
🐛 Describe the bug
根据README.md文档安装环境,但是跑不起来。报错如下:
No pre-built kernel is found, build and load the cpu_adam kernel during runtime now
Traceback (most recent call last):
File "/home/panz/anaconda3/envs/chatgpt/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py", line 135, in load
op_module = self.import_op()
File "/home/panz/anaconda3/envs/chatgpt/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py", line 118, in import_op
return importlib.import_module(self.prebuilt_import_path)
File "/home/panz/anaconda3/envs/chatgpt/lib/python3.8/importlib/init.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "./train_bert_demo.py", line 332, in
main()
File "./train_bert_demo.py", line 231, in main
optimizer = HybridAdam(model.parameters(), lr=1e-3)
File "/home/panz/anaconda3/envs/chatgpt/lib/python3.8/site-packages/colossalai/nn/optimizer/hybrid_adam.py", line 82, in init
cpu_optim = CPUAdamBuilder().load()
File "/home/panz/anaconda3/envs/chatgpt/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py", line 159, in load
extra_include_paths=self.strip_empty_entries(self.include_dirs()),
File "/home/panz/anaconda3/envs/chatgpt/lib/python3.8/site-packages/colossalai/kernel/op_builder/cpu_adam.py", line 25, in include_dirs
self.get_cuda_home_include()
File "/home/panz/anaconda3/envs/chatgpt/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py", line 71, in get_cuda_home_include
raise RuntimeError("CUDA_HOME is None, please set CUDA_HOME to compile C++/CUDA kernels in ColossalAI.")
RuntimeError: CUDA_HOME is None, please set CUDA_HOME to compile C++/CUDA kernels in ColossalAI.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 37184) of binary: /home/panz/anaconda3/envs/chatgpt/bin/python
Traceback (most recent call last):
File "/home/panz/anaconda3/envs/chatgpt/bin/torchrun", line 8, in
sys.exit(main())
File "/home/panz/anaconda3/envs/chatgpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/panz/anaconda3/envs/chatgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/home/panz/anaconda3/envs/chatgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/panz/anaconda3/envs/chatgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/panz/anaconda3/envs/chatgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
./train_bert_demo.py FAILED ./train_bert_demo.py FAILED
Failures: <NO_OTHER_FAILURES>
Environment
torch1.13.1 transformers4.26.1 colossalai0.2.5
运行chatGPT项目时,一直停留在 No pre-built kernel is found, build and load the cpu_adam kernel during runtime now 显存显示已经载入模型,但没有任何进展。请问是否与内网无法连接网络有关?
运行chatGPT项目时,一直停留在 No pre-built kernel is found, build and load the cpu_adam kernel during runtime now 显存显示已经载入模型,但没有任何进展。请问是否与内网无法连接网络有关?
能看一下colossalai check -i
吗?
运行chatGPT项目时,一直停留在 No pre-built kernel is found, build and load the cpu_adam kernel during runtime now 显存显示已经载入模型,但没有任何进展。请问是否与内网无法连接网络有关?
能看一下
colossalai check -i
吗?
Installation Report
------------ Environment ------------ Colossal-AI version: 0.2.5 PyTorch version: 1.12.0 CUDA version: N/A CUDA version required by PyTorch: 11.3
Note:
- The table above checks the versions of the libraries/tools in the current environment
- If the CUDA version is N/A, you can set the CUDA_HOME environment variable to locate it
------------ CUDA Extensions AOT Compilation ------------ Found AOT CUDA Extension: x PyTorch version used for AOT compilation: N/A CUDA version used for AOT compilation: N/A
Note:
- AOT (ahead-of-time) compilation of the CUDA kernels occurs during installation when the environment varialbe CUDA_EXT=1 is set
- If AOT compilation is not enabled, stay calm as the CUDA kernels can still be built during runtime
------------ Compatibility ------------ PyTorch version match: N/A System and PyTorch CUDA version match: N/A System and Colossal-AI CUDA version match: N/A
Note:
- The table above checks the version compatibility of the libraries/tools in the current environment
- PyTorch version mistach: whether the PyTorch version in the current environment is compatible with the PyTorch version used for AOT compilation
- System and PyTorch CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version required by PyTorch
- System and Colossal-AI CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version used for AOT compilation
Colossal-AI version: 0.2.5 PyTorch version: 1.13.1 CUDA version: 10.1 CUDA version required by PyTorch: 11.7
Note:
The table above checks the versions of the libraries/tools in the current environment If the CUDA version is N/A, you can set the CUDA_HOME environment variable to locate it ------------ CUDA Extensions AOT Compilation ------------ Found AOT CUDA Extension: x PyTorch version used for AOT compilation: N/A CUDA version used for AOT compilation: N/A
Note:
AOT (ahead-of-time) compilation of the CUDA kernels occurs during installation when the environment varialbe CUDA_EXT=1 is set If AOT compilation is not enabled, stay calm as the CUDA kernels can still be built during runtime ------------ Compatibility ------------ PyTorch version match: N/A System and PyTorch CUDA version match: × System and Colossal-AI CUDA version match: N/A
Note:
The table above checks the version compatibility of the libraries/tools in the current environment PyTorch version mistach: whether the PyTorch version in the current environment is compatible with the PyTorch version used for AOT compilation System and PyTorch CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version required by PyTorch System and Colossal-AI CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version used for AOT compilation
cuda 10.1 torch 1.10.0 torchvision 0.11.1 It works
cuda 10.1 torch 1.10.0 torchvision 0.11.1 It works
你是解决了吗?我能添加一个你的联系方式吗,我第一次接触这类项目,有些问题问您
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
cuda 10.1 torch 1.10.0 torchvision 0.11.1 It works
Have you solved it yet? Can I add your contact information, this is my first contact with this kind of project, I have some questions for you
看下gcc版本,提升一下gcc版本
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
Look at the gcc version, upgrade the gcc version
gcc版本7.5.0
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
gcc version 7.5.0
colossalai用源码安装一下呢
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
colossalai install it with source code
没有下文了吗???这个
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
Is there no more text? ? ? this
遇到同样的问题
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
encountered the same problem
这个解决了吗?
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
Is this solved?
Hi, sorry for getting to this late.
Would this issue https://github.com/hpcaitech/ColossalAI/issues/3496 be any helpful?
We have updated a lot. Please check the latest code. This issue was closed due to inactivity. Thanks.