ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: run ColossalAI-Examples/image/resnet failed

Open better629 opened this issue 2 years ago • 2 comments

🐛 Describe the bug

run the simplest ColossalAI-Examples/image/resnet,After colossalai run only get Error: failed to run torchrun, no other messages. How to find out?

Environment

Colossal-AI version: 0.2.0

PyTorch Version: 1.10.0 PyTorch Version required by Colossal-AI: 1.10 PyTorch version match: ✓

System CUDA Version: 11.8 CUDA Version required by PyTorch: 11.1 CUDA Version required by Colossal-AI: 11.8 CUDA Version Match: x

CUDA Extension: ✓

conda env has installed cuda toolkit 11.1, but the host machine cuda toolkit is 11.8.

better629 avatar Jan 07 '23 05:01 better629

Hey, ColossalAI-Examples have been deprecated. You can refer to the examples in ColossalAI/example. Although there is no resnet example right now, you can build it on your own. Looking forwards for your contribution.

feifeibear avatar Jan 07 '23 11:01 feifeibear

Same with the ColossalAI/examples/tutorial/hybrid_parallel

log

Using /root/.cache/torch_extensions/py39_cu111 as PyTorch extensions root...

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Your compiler (c++) is not compatible with the compiler Pytorch was
built with for this platform, which is g++ on linux. Please
use g++ to to compile your extension. Alternatively, you may
compile PyTorch from source using c++, and then you can also use
c++ to compile your extension.

See https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md for help
with compiling PyTorch from source.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

                              !! WARNING !!
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py39_cu111/moe/build.ninja...
Building extension module moe...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module moe...
Time to load moe op: 0.15326189994812012 seconds
Error: failed to run torchrun --nproc_per_node=2 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job train.py --config config.py on 127.0.0.1

How can I get more detail error messages ?

better629 avatar Jan 09 '23 03:01 better629

We have updated a lot. This issue was closed due to inactivity. Thanks. https://github.com/hpcaitech/ColossalAI/tree/main/examples

binmakeswell avatar Apr 14 '23 08:04 binmakeswell