[BUG]: run ColossalAI-Examples/image/resnet failed
🐛 Describe the bug
run the simplest ColossalAI-Examples/image/resnet,After colossalai run only get Error: failed to run torchrun, no other messages. How to find out?
Environment
Colossal-AI version: 0.2.0
PyTorch Version: 1.10.0 PyTorch Version required by Colossal-AI: 1.10 PyTorch version match: ✓
System CUDA Version: 11.8 CUDA Version required by PyTorch: 11.1 CUDA Version required by Colossal-AI: 11.8 CUDA Version Match: x
CUDA Extension: ✓
conda env has installed cuda toolkit 11.1, but the host machine cuda toolkit is 11.8.
Hey, ColossalAI-Examples have been deprecated. You can refer to the examples in ColossalAI/example. Although there is no resnet example right now, you can build it on your own. Looking forwards for your contribution.
Same with the ColossalAI/examples/tutorial/hybrid_parallel
log
Using /root/.cache/torch_extensions/py39_cu111 as PyTorch extensions root...
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Your compiler (c++) is not compatible with the compiler Pytorch was
built with for this platform, which is g++ on linux. Please
use g++ to to compile your extension. Alternatively, you may
compile PyTorch from source using c++, and then you can also use
c++ to compile your extension.
See https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md for help
with compiling PyTorch from source.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! WARNING !!
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py39_cu111/moe/build.ninja...
Building extension module moe...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module moe...
Time to load moe op: 0.15326189994812012 seconds
Error: failed to run torchrun --nproc_per_node=2 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job train.py --config config.py on 127.0.0.1
How can I get more detail error messages ?
We have updated a lot. This issue was closed due to inactivity. Thanks. https://github.com/hpcaitech/ColossalAI/tree/main/examples