ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: 单机多节点 执行 examples/language/gpt/gemin卡住不动

Open Cloopen-ReLiNK opened this issue 2 years ago • 11 comments

🐛 Describe the bug

No pre-built kernel is found, build and load the cpu_adam kernel during runtime now 卡住不动 单机多节点,执行examples/language/gpt/gemin

Environment

No response

Cloopen-ReLiNK avatar Mar 07 '23 09:03 Cloopen-ReLiNK

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Title: [BUG]: Executing examples/language/gpt/gemin on a single machine with multiple nodes stuck

Issues-translate-bot avatar Mar 07 '23 09:03 Issues-translate-bot

colossalai check -i

Installation Report

------------ Environment ------------ Colossal-AI version: 0.2.5 PyTorch version: 1.11.0 CUDA version: 11.3 CUDA version required by PyTorch: 11.3

Note:

  1. The table above checks the versions of the libraries/tools in the current environment
  2. If the CUDA version is N/A, you can set the CUDA_HOME environment variable to locate it

------------ CUDA Extensions AOT Compilation ------------ Found AOT CUDA Extension: x PyTorch version used for AOT compilation: N/A CUDA version used for AOT compilation: N/A

Note:

  1. AOT (ahead-of-time) compilation of the CUDA kernels occurs during installation when the environment varialbe CUDA_EXT=1 is set
  2. If AOT compilation is not enabled, stay calm as the CUDA kernels can still be built during runtime

------------ Compatibility ------------ PyTorch version match: N/A System and PyTorch CUDA version match: ✓ System and Colossal-AI CUDA version match: N/A

Note:

  1. The table above checks the version compatibility of the libraries/tools in the current environment
    • PyTorch version mistach: whether the PyTorch version in the current environment is compatible with the PyTorch version used for AOT compilation
    • System and PyTorch CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version required by PyTorch
    • System and Colossal-AI CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version used for AOT compilation

Cloopen-ReLiNK avatar Mar 07 '23 09:03 Cloopen-ReLiNK

单机单卡可以执行

Cloopen-ReLiNK avatar Mar 07 '23 09:03 Cloopen-ReLiNK

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Stand-alone single card can execute

Issues-translate-bot avatar Mar 07 '23 09:03 Issues-translate-bot

========================================================================================= No pre-built kernel is found, build and load the cpu_adam kernel during runtime now

Cloopen-ReLiNK avatar Mar 07 '23 09:03 Cloopen-ReLiNK

install colossalai

git clone https://github.com/hpcaitech/ColossalAI.git
&& cd ./ColossalAI
&& CUDA_EXT=1 pip install -v --no-cache-dir .

源码安装能解决这个问题

joan126 avatar Mar 07 '23 10:03 joan126

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


install colossalai

git clone https://github.com/hpcaitech/ColossalAI.git
&& cd ./ColossalAI
&& CUDA_EXT=1 pip install -v --no-cache-dir .

Installing from source can solve this problem

Issues-translate-bot avatar Mar 07 '23 10:03 Issues-translate-bot

我就是源码编译安装的

Cloopen-ReLiNK avatar Mar 08 '23 03:03 Cloopen-ReLiNK

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


I just compiled and installed from source

Issues-translate-bot avatar Mar 08 '23 03:03 Issues-translate-bot

CUDA_EXT=1 pip install -v --no-cache-dir .

CUDA_EXT=1 pip install -v --no-cache-dir .

使用这个命令吗?

joan126 avatar Mar 08 '23 03:03 joan126

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


CUDA_EXT=1 pip install -v --no-cache-dir .

CUDA_EXT=1 pip install -v --no-cache-dir .

Use this command?

Issues-translate-bot avatar Mar 08 '23 03:03 Issues-translate-bot

I got the same problem

alibabadoufu avatar Mar 30 '23 15:03 alibabadoufu

Hi, sorry for getting to this late.

Would this issue #3496 be any helpful?

JThh avatar Apr 13 '23 07:04 JThh

We have updated a lot. Please check the latest code. This issue was closed due to inactivity. Thanks.

binmakeswell avatar May 05 '23 04:05 binmakeswell