ColossalAI
ColossalAI copied to clipboard
[BUG]: No module named 'colossalai._C.cpu_adam'
π Describe the bug
When I run the command torchrun --standalone --nproc_per_node=2 train_prompts.py prompts.csv --strategy colossalai_gemini
in the ColossalAI/applications/ChatGPT/examples
directory, an error occurs.
ModuleNotFoundError: No module named 'colossalai._C.cpu_adam'
Environment
File "train_prompts.py", line 115, in <module> main(args) File "train_prompts.py", line 50, in main actor_optim = HybridAdam(actor.parameters(), lr=5e-6) File "/path/anaconda3/envs/colossalai/lib/python3.7/site-packages/colossalai/nn/optimizer/hybrid_adam.py", line 82, in __init__ cpu_optim = CPUAdamBuilder().load() File "/path/anaconda3/envs/colossalai/lib/python3.7/site-packages/colossalai/kernel/op_builder/builder.py", line 164, in load verbose=verbose) File "/path/anaconda3/envs/colossalai/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1296, in load keep_intermediates=keep_intermediates) File "/path/anaconda3/envs/colossalai/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1534, in _jit_compile return _import_module_from_library(name, build_directory, is_python_module) File "/path/anaconda3/envs/colossalai/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1936, in _import_module_from_library module = importlib.util.module_from_spec(spec) File "<frozen importlib._bootstrap>", line 583, in module_from_spec File "<frozen importlib._bootstrap_external>", line 1043, in create_module File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed ImportError: /path/.cache/colossalai/torch_extensions/torch1.13_cu11.7/cpu_adam.so: cannot open shared object file: No such file or directory WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 128965 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 128966) of binary: /path/anaconda3/envs/colossalai/bin/python3.7
You have to install the Colossal CUDA extension.
CUDA_EXT=1 pip install colossalai
Also checkout this issue and install apex beforehand.
You have to install the Colossal CUDA extension.
CUDA_EXT=1 pip install colossalai
I tried this, but it still shows the same error. "ModuleNotFoundError: No module named 'colossalai._C.cpu_adam'". It should be a problem with installing the colossalai package. But I can't open https://release.colossalai.org now.
Hi, can you provide your colossalai version?
Bot detected the issue body's language is not English, translate it automatically. π―ππ»π§βπ€βπ§π«π§πΏβπ€βπ§π»π©πΎβπ€βπ¨πΏπ¬πΏ
Hi, can you provide your colossalai version?
colossalai==0.2.5
Bot detected the issue body's language is not English, translate it automatically. π―ππ»π§βπ€βπ§π«π§πΏβπ€βπ§π»π©πΎβπ€βπ¨πΏπ¬πΏ
colossals==0.2.5
I see. The error seems to be implied by the line ImportError: /path/.cache/colossalai/torch_extensions/torch1.13_cu11.7/cpu_adam.so: cannot open shared object file: No such file or directory
. There are a few things to check:
- which operating system are you using?
- it seems we cannot resolve the path
~
to be your home directory, can you try to provide the output ofos.path.expanduser('~')
?
The error suggests that Python resolves the ~
directory to be /path
, does this path actually exist? Normally ~
will be resolved to /home/<user-name>
on a linux machine.
At first time, I downloaded colossalai by βpip install colossalai==0.1.12+torch1.12cu11.3 -f https://release.colossalai.orgβ, and no error is displayed. I want to try this download method again, but the link is no longer open.
Same issue with cpu_adam and βpip install colossalai==0.1.12+torch1.12cu11.3 -f https://release.colossalai.orgβ is a 404 now.
Same issue with cpu_adam and βpip install colossalai==0.1.12+torch1.12cu11.3 -f https://release.colossalai.orgβ is a 404 now.
@sourasis you can try installing with pip install colossalai
directly.
At first time, I downloaded colossalai by βpip install colossalai==0.1.12+torch1.12cu11.3 -f https://release.colossalai.orgβ, and no error is displayed. I want to try this download method again, but the link is no longer open.
@lht947590837 this release has been removed from our own private pip source, we have moved to pypi so that you can use pip install colossalai
instead of pip install colossalai==0.1.12+torch1.12cu11.3 -f https://release.colossalai.org
.
Same issue with cpu_adam and βpip install colossalai==0.1.12+torch1.12cu11.3 -f https://release.colossalai.orgβ is a 404 now.
@sourasis you can try installing with
pip install colossalai
directly.
@FrankLeeeee Hi, yes I did that. That worked. Thanks. But the cpu_adam error persists when I try to run gemini in gpt example. Please suggest a solution for this
Same issue with cpu_adam and βpip install colossalai==0.1.12+torch1.12cu11.3 -f https://release.colossalai.orgβ is a 404 now.
@sourasis you can try installing with
pip install colossalai
directly.@FrankLeeeee Hi, yes I did that. That worked. Thanks. But the cpu_adam error persists when I try to run gemini in gpt example. Please suggest a solution for this
Hi @sourasis , several users have met the same issue, may I which operating system are your using? Suspect that this is a windows issue.
met same error: colossalai : 0.2.5 pip install colossalai directly; For Pytorch_DDP distplan, it's ok; For CAR_Gemini, met that error.
Hi @joan126 noted, I am looking into this issue. Can you provide more information like pytorch version, cuda version, operating system and version, python version
Same issue with cpu_adam and βpip install colossalai==0.1.12+torch1.12cu11.3 -f https://release.colossalai.orgβ is a 404 now.
@sourasis you can try installing with
pip install colossalai
directly.@FrankLeeeee Hi, yes I did that. That worked. Thanks. But the cpu_adam error persists when I try to run gemini in gpt example. Please suggest a solution for this
Hi @sourasis , several users have met the same issue, may I which operating system are your using? Suspect that this is a windows issue.
@FrankLeeeee I am on Ubuntu 22.04. Here is the list of issues I can see on the trace
In file included from /home/sourasis/anaconda3/envs/venv_name/lib/python3.10/site-packages/colossalai/kernel/cuda_native/csrc/cpu_adam.cpp:22:
/home/sourasis/anaconda3/envs/venv_name/lib/python3.10/site-packages/colossalai/kernel/cuda_native/csrc/cpu_adam.h:24:10: fatal error: cublas_v2.h: No such file or directory
24 | #include <cublas_v2.h>
compilation terminated.
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/home/sourasis/anaconda3/envs/venv_name/lib/python3.10/site-packages/colossalai/kernel/op_builder/builder.py", line 135, in load
op_module = self.import_op()
File "/home/sourasis/anaconda3/envs/venv_name/lib/python3.10/site-packages/colossalai/kernel/op_builder/builder.py", line 118, in import_op
return importlib.import_module(self.prebuilt_import_path)
File "/home/sourasis/anaconda3/envs/venv_name/lib/python3.10/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "
============================================================ ./train_gpt_demo.py FAILED
pytorch 1.12.0 cudatoolkit 11.3.1 colossalai 0.2.5
Hi @joan126 noted, I am looking into this issue. Can you provide more information like pytorch version, cuda version, operating system and version, python version
pytorch : 1.12.1 cuda:11.3 os: Linux version 5.4.217-1-1.e17.elrepo.x86_64 python: 3.7.12
@sourasis you problem is related to your CUDA environment as it cannot find cublas_v2.h. Can you check if your CUDA is available in your system and CUDA_HOME is set? Normally, you can find CUDA in /usr/local
.
upgrade gcc version from 4.8.5 to 9.5.0 fix this issue, but met another issue " AttributeError : module 'math' has not attribute prod "
upgrade gcc version from 4.8.5 to 9.5.0 fix this issue, but met another issue " AttributeError : module 'math' has not attribute prod "
Great. The math issue is due to the Python version, try to use Python 3.8 instead with conda create -n <env-name> python=3.8
great...
upgrade gcc version from 4.8.5 to 9.5.0 fix this issue, but met another issue " AttributeError : module 'math' has not attribute prod "
Great. The math issue is due to the Python version, try to use Python 3.8 instead with
conda create -n <env-name> python=3.8
@joan126 #2837 has fixed the math issue. Thanks.
upgrade gcc version from 4.8.5 to 9.5.0 fix this issue, but met another issue " AttributeError : module 'math' has not attribute prod "
My gcc version is 9.4.0, i meet the same problem, do I have to update gcc to 9.5.0?
upgrade gcc version from 4.8.5 to 9.5.0 fix this issue, but met another issue " AttributeError : module 'math' has not attribute prod "
Great. The math issue is due to the Python version, try to use Python 3.8 instead with
conda create -n <env-name> python=3.8
Could you please explain on this a bit more? Is there a big difference?
upgrade gcc version from 4.8.5 to 9.5.0 fix this issue, but met another issue " AttributeError : module 'math' has not attribute prod "
Great. The math issue is due to the Python version, try to use Python 3.8 instead with
conda create -n <env-name> python=3.8
Could you please explain on this a bit more? Is there a big difference?
You can check it out #2837 @secsilm
upgrade gcc version from 4.8.5 to 9.5.0 fix this issue, but met another issue " AttributeError : module 'math' has not attribute prod "
Great. The math issue is due to the Python version, try to use Python 3.8 instead with
conda create -n <env-name> python=3.8
Could you please explain on this a bit more? Is there a big difference?
You can check it out #2837 @secsilm
Thanks for your quick reply. OK I got it.
when run examples/gpt/titans/run.sh . I met a new error "no module named colossalai._C.layernorm error " with GCC 9.5.0 version. could you help to see it? @FrankLeeeee
@sourasis you problem is related to your CUDA environment as it cannot find cublas_v2.h. Can you check if your CUDA is available in your system and CUDA_HOME is set? Normally, you can find CUDA in
/usr/local
.
@FrankLeeeee Issue was when I tried to install nvidia driver 525 it removed cuda. So to counter that I used conda to install cuda. But this installation didnt bring Cublas along with it. My CUDA_HOME is set to this conda installation of Cuda.
I have now installed nvidia-cublas-cu11 separately in conda environment and I can see that the cublas_v2.h file is there but inside a separate directory from the CUDA_HOME. if I can put this directory path as well in -isystem parameter in "build.ninja" (/.cache/colossalai/torch_extensions/torch1.12_cu11.3/build.ninja) then it can work. But I can't find the file which creates build.ninja in runtime.
Can you please tell me which file is creating build.ninja in runtime so that I can plug the cublas/inlcude directory to it as well?