ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: build and load the fused_optim error : /usr/bin/ld: 找不到 -lcudart: 没有那个文件或目录

Open verigle opened this issue 2 years ago • 4 comments

🐛 Describe the bug

运行applications/ChatGPT/examples/train_dummy.py 时报错

=========================================================================================
No pre-built kernel is found, build and load the fused_optim kernel during runtime now
=========================================================================================
Detected CUDA files, patching ldflags
Emitting ninja build file /home/verigle/.cache/colossalai/torch_extensions/torch1.13_cu11.7/build.ninja...
Building extension module fused_optim...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/1] c++ colossal_C_frontend.o multi_tensor_sgd_kernel.cuda.o multi_tensor_scale_kernel.cuda.o multi_tensor_adam.cuda.o multi_tensor_l2norm_kernel.cuda.o multi_tensor_lamb.cuda.o -shared -L/home/verigle/miniconda3/envs/colossalai/lib/python3.8/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L/home/verigle/miniconda3/envs/colossalai/lib64 -lcudart -o fused_optim.so
FAILED: fused_optim.so 
c++ colossal_C_frontend.o multi_tensor_sgd_kernel.cuda.o multi_tensor_scale_kernel.cuda.o multi_tensor_adam.cuda.o multi_tensor_l2norm_kernel.cuda.o multi_tensor_lamb.cuda.o -shared -L/home/verigle/miniconda3/envs/colossalai/lib/python3.8/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L/home/verigle/miniconda3/envs/colossalai/lib64 -lcudart -o fused_optim.so
/usr/bin/ld: 找不到 -lcudart: 没有那个文件或目录

Environment

conda env : python = 3.8 cuda = 11.7 pytorch = 1.13

verigle avatar Feb 25 '23 16:02 verigle

with export LD_LIBRARY_PATH=/path/to/your/cuda/lib64:${LD_LIBRARY_PATH} the program still can't find the -lcublas, -lcudart, -lcurand

but with export LIBRARY_PATH=/path/to/your/cuda/lib64:${LIBRARY_PATH} it worked for me.

hignlight the note that using ENV of LIBRARY_PATH rather than LD_LIBRARY_PATH in document is strongly suggested!

verigle avatar Feb 26 '23 06:02 verigle

Hi @verigle Thanks for your contribution! @FrankLeeeee Can we fix it later? pre-built kernel seems to cause trouble for many users.

binmakeswell avatar Mar 03 '23 08:03 binmakeswell

with export LD_LIBRARY_PATH=/path/to/your/cuda/lib64:${LD_LIBRARY_PATH} the program still can't find the -lcublas, -lcudart, -lcurand

but with export LIBRARY_PATH=/path/to/your/cuda/lib64:${LIBRARY_PATH} it worked for me.

hignlight the note that using ENV of LIBRARY_PATH rather than LD_LIBRARY_PATH in document is strongly suggested!

Good suggestion, I will add such checks next week. Meanwhile, I am working on improving the kernel build in #2886 .

FrankLeeeee avatar Mar 03 '23 09:03 FrankLeeeee

sudo ln -s /usr/local/cuda/lib64/libcudart.so /usr/lib/libcudart.so

JingxinLee avatar Mar 07 '23 08:03 JingxinLee

We have updated a lot. This issue was closed due to inactivity. Thanks.

binmakeswell avatar Apr 26 '23 10:04 binmakeswell