[BUG]: 单机多节点 执行 examples/language/gpt/gemin卡住不动
🐛 Describe the bug
No pre-built kernel is found, build and load the cpu_adam kernel during runtime now 卡住不动 单机多节点,执行examples/language/gpt/gemin
Environment
No response
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
Title: [BUG]: Executing examples/language/gpt/gemin on a single machine with multiple nodes stuck
colossalai check -i
Installation Report
------------ Environment ------------ Colossal-AI version: 0.2.5 PyTorch version: 1.11.0 CUDA version: 11.3 CUDA version required by PyTorch: 11.3
Note:
- The table above checks the versions of the libraries/tools in the current environment
- If the CUDA version is N/A, you can set the CUDA_HOME environment variable to locate it
------------ CUDA Extensions AOT Compilation ------------ Found AOT CUDA Extension: x PyTorch version used for AOT compilation: N/A CUDA version used for AOT compilation: N/A
Note:
- AOT (ahead-of-time) compilation of the CUDA kernels occurs during installation when the environment varialbe CUDA_EXT=1 is set
- If AOT compilation is not enabled, stay calm as the CUDA kernels can still be built during runtime
------------ Compatibility ------------ PyTorch version match: N/A System and PyTorch CUDA version match: ✓ System and Colossal-AI CUDA version match: N/A
Note:
- The table above checks the version compatibility of the libraries/tools in the current environment
- PyTorch version mistach: whether the PyTorch version in the current environment is compatible with the PyTorch version used for AOT compilation
- System and PyTorch CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version required by PyTorch
- System and Colossal-AI CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version used for AOT compilation
单机单卡可以执行
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
Stand-alone single card can execute
========================================================================================= No pre-built kernel is found, build and load the cpu_adam kernel during runtime now
install colossalai
git clone https://github.com/hpcaitech/ColossalAI.git
&& cd ./ColossalAI
&& CUDA_EXT=1 pip install -v --no-cache-dir .
源码安装能解决这个问题
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
install colossalai
git clone https://github.com/hpcaitech/ColossalAI.git
&& cd ./ColossalAI
&& CUDA_EXT=1 pip install -v --no-cache-dir .
Installing from source can solve this problem
我就是源码编译安装的
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
I just compiled and installed from source
CUDA_EXT=1 pip install -v --no-cache-dir .
CUDA_EXT=1 pip install -v --no-cache-dir .
使用这个命令吗?
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
CUDA_EXT=1 pip install -v --no-cache-dir .
CUDA_EXT=1 pip install -v --no-cache-dir .
Use this command?
I got the same problem
Hi, sorry for getting to this late.
Would this issue #3496 be any helpful?
We have updated a lot. Please check the latest code. This issue was closed due to inactivity. Thanks.