ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: No module named 'colossalai._C.cpu_adam'

Open Yiran-Zhu opened this issue 2 years ago β€’ 33 comments

πŸ› Describe the bug

When I run the command torchrun --standalone --nproc_per_node=2 train_prompts.py prompts.csv --strategy colossalai_gemini in the ColossalAI/applications/ChatGPT/examples directory, an error occurs. ModuleNotFoundError: No module named 'colossalai._C.cpu_adam'

Environment

File "train_prompts.py", line 115, in <module> main(args) File "train_prompts.py", line 50, in main actor_optim = HybridAdam(actor.parameters(), lr=5e-6) File "/path/anaconda3/envs/colossalai/lib/python3.7/site-packages/colossalai/nn/optimizer/hybrid_adam.py", line 82, in __init__ cpu_optim = CPUAdamBuilder().load() File "/path/anaconda3/envs/colossalai/lib/python3.7/site-packages/colossalai/kernel/op_builder/builder.py", line 164, in load verbose=verbose) File "/path/anaconda3/envs/colossalai/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1296, in load keep_intermediates=keep_intermediates) File "/path/anaconda3/envs/colossalai/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1534, in _jit_compile return _import_module_from_library(name, build_directory, is_python_module) File "/path/anaconda3/envs/colossalai/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1936, in _import_module_from_library module = importlib.util.module_from_spec(spec) File "<frozen importlib._bootstrap>", line 583, in module_from_spec File "<frozen importlib._bootstrap_external>", line 1043, in create_module File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed ImportError: /path/.cache/colossalai/torch_extensions/torch1.13_cu11.7/cpu_adam.so: cannot open shared object file: No such file or directory WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 128965 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 128966) of binary: /path/anaconda3/envs/colossalai/bin/python3.7

Yiran-Zhu avatar Feb 16 '23 03:02 Yiran-Zhu

You have to install the Colossal CUDA extension. CUDA_EXT=1 pip install colossalai

JThh avatar Feb 16 '23 03:02 JThh

Also checkout this issue and install apex beforehand.

JThh avatar Feb 16 '23 03:02 JThh

You have to install the Colossal CUDA extension. CUDA_EXT=1 pip install colossalai I tried this, but it still shows the same error. "ModuleNotFoundError: No module named 'colossalai._C.cpu_adam'". It should be a problem with installing the colossalai package. But I can't open https://release.colossalai.org now.

lht947590837 avatar Feb 21 '23 02:02 lht947590837

Hi, can you provide your colossalai version?

FrankLeeeee avatar Feb 21 '23 02:02 FrankLeeeee

Bot detected the issue body's language is not English, translate it automatically. πŸ‘―πŸ‘­πŸ»πŸ§‘β€πŸ€β€πŸ§‘πŸ‘«πŸ§‘πŸΏβ€πŸ€β€πŸ§‘πŸ»πŸ‘©πŸΎβ€πŸ€β€πŸ‘¨πŸΏπŸ‘¬πŸΏ


Hi, can you provide your colossalai version?

Issues-translate-bot avatar Feb 21 '23 02:02 Issues-translate-bot

colossalai==0.2.5

lht947590837 avatar Feb 21 '23 02:02 lht947590837

Bot detected the issue body's language is not English, translate it automatically. πŸ‘―πŸ‘­πŸ»πŸ§‘β€πŸ€β€πŸ§‘πŸ‘«πŸ§‘πŸΏβ€πŸ€β€πŸ§‘πŸ»πŸ‘©πŸΎβ€πŸ€β€πŸ‘¨πŸΏπŸ‘¬πŸΏ


colossals==0.2.5

Issues-translate-bot avatar Feb 21 '23 02:02 Issues-translate-bot

I see. The error seems to be implied by the line ImportError: /path/.cache/colossalai/torch_extensions/torch1.13_cu11.7/cpu_adam.so: cannot open shared object file: No such file or directory. There are a few things to check:

  1. which operating system are you using?
  2. it seems we cannot resolve the path ~ to be your home directory, can you try to provide the output of os.path.expanduser('~')?

FrankLeeeee avatar Feb 21 '23 02:02 FrankLeeeee

The error suggests that Python resolves the ~ directory to be /path, does this path actually exist? Normally ~ will be resolved to /home/<user-name> on a linux machine.

FrankLeeeee avatar Feb 21 '23 02:02 FrankLeeeee

At first time, I downloaded colossalai by β€œpip install colossalai==0.1.12+torch1.12cu11.3 -f https://release.colossalai.org”, and no error is displayed. I want to try this download method again, but the link is no longer open.

lht947590837 avatar Feb 21 '23 02:02 lht947590837

Same issue with cpu_adam and β€œpip install colossalai==0.1.12+torch1.12cu11.3 -f https://release.colossalai.org” is a 404 now.

sourasis avatar Feb 22 '23 19:02 sourasis

Same issue with cpu_adam and β€œpip install colossalai==0.1.12+torch1.12cu11.3 -f https://release.colossalai.org” is a 404 now.

@sourasis you can try installing with pip install colossalai directly.

FrankLeeeee avatar Feb 23 '23 02:02 FrankLeeeee

At first time, I downloaded colossalai by β€œpip install colossalai==0.1.12+torch1.12cu11.3 -f https://release.colossalai.org”, and no error is displayed. I want to try this download method again, but the link is no longer open.

@lht947590837 this release has been removed from our own private pip source, we have moved to pypi so that you can use pip install colossalai instead of pip install colossalai==0.1.12+torch1.12cu11.3 -f https://release.colossalai.org.

FrankLeeeee avatar Feb 23 '23 02:02 FrankLeeeee

Same issue with cpu_adam and β€œpip install colossalai==0.1.12+torch1.12cu11.3 -f https://release.colossalai.org” is a 404 now.

@sourasis you can try installing with pip install colossalai directly.

@FrankLeeeee Hi, yes I did that. That worked. Thanks. But the cpu_adam error persists when I try to run gemini in gpt example. Please suggest a solution for this

sourasis avatar Feb 23 '23 04:02 sourasis

Same issue with cpu_adam and β€œpip install colossalai==0.1.12+torch1.12cu11.3 -f https://release.colossalai.org” is a 404 now.

@sourasis you can try installing with pip install colossalai directly.

@FrankLeeeee Hi, yes I did that. That worked. Thanks. But the cpu_adam error persists when I try to run gemini in gpt example. Please suggest a solution for this

Hi @sourasis , several users have met the same issue, may I which operating system are your using? Suspect that this is a windows issue.

FrankLeeeee avatar Feb 23 '23 05:02 FrankLeeeee

met same error: colossalai : 0.2.5 pip install colossalai directly; For Pytorch_DDP distplan, it's ok; For CAR_Gemini, met that error.

joan126 avatar Feb 23 '23 05:02 joan126

Hi @joan126 noted, I am looking into this issue. Can you provide more information like pytorch version, cuda version, operating system and version, python version

FrankLeeeee avatar Feb 23 '23 05:02 FrankLeeeee

Same issue with cpu_adam and β€œpip install colossalai==0.1.12+torch1.12cu11.3 -f https://release.colossalai.org” is a 404 now.

@sourasis you can try installing with pip install colossalai directly.

@FrankLeeeee Hi, yes I did that. That worked. Thanks. But the cpu_adam error persists when I try to run gemini in gpt example. Please suggest a solution for this

Hi @sourasis , several users have met the same issue, may I which operating system are your using? Suspect that this is a windows issue.

@FrankLeeeee I am on Ubuntu 22.04. Here is the list of issues I can see on the trace

In file included from /home/sourasis/anaconda3/envs/venv_name/lib/python3.10/site-packages/colossalai/kernel/cuda_native/csrc/cpu_adam.cpp:22: /home/sourasis/anaconda3/envs/venv_name/lib/python3.10/site-packages/colossalai/kernel/cuda_native/csrc/cpu_adam.h:24:10: fatal error: cublas_v2.h: No such file or directory 24 | #include <cublas_v2.h> compilation terminated. ninja: build stopped: subcommand failed. Traceback (most recent call last): File "/home/sourasis/anaconda3/envs/venv_name/lib/python3.10/site-packages/colossalai/kernel/op_builder/builder.py", line 135, in load op_module = self.import_op() File "/home/sourasis/anaconda3/envs/venv_name/lib/python3.10/site-packages/colossalai/kernel/op_builder/builder.py", line 118, in import_op return importlib.import_module(self.prebuilt_import_path) File "/home/sourasis/anaconda3/envs/venv_name/lib/python3.10/importlib/init.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1050, in _gcd_import File "", line 1027, in _find_and_load File "", line 1004, in _find_and_load_unlocked ModuleNotFoundError: No module named 'colossalai._C.cpu_adam' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/sourasis/anaconda3/envs/venv_name/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1808, in _run_ninja_build subprocess.run( File "/home/sourasis/anaconda3/envs/venv_name/lib/python3.10/subprocess.py", line 526, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1. The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/sourasis/machinelearning/colossal/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 353, in main() File "/home/sourasis/machinelearning/colossal/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 255, in main optimizer = HybridAdam(model.parameters(), lr=1e-3) File "/home/sourasis/anaconda3/envs/venv_name/lib/python3.10/site-packages/colossalai/nn/optimizer/hybrid_adam.py", line 82, in init cpu_optim = CPUAdamBuilder().load() File "/home/sourasis/anaconda3/envs/venv_name/lib/python3.10/site-packages/colossalai/kernel/op_builder/builder.py", line 157, in load op_module = load(name=self.name, File "/home/sourasis/anaconda3/envs/venv_name/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1202, in load return _jit_compile( File "/home/sourasis/anaconda3/envs/venv_name/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1425, in _jit_compile _write_ninja_file_and_build_library( File "/home/sourasis/anaconda3/envs/venv_name/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1537, in _write_ninja_file_and_build_library _run_ninja_build( File "/home/sourasis/anaconda3/envs/venv_name/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1824, in _run_ninja_build raise RuntimeError(message) from e RuntimeError: Error building extension 'cpu_adam' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3610) of binary: /home/sourasis/anaconda3/envs/venv_name/bin/python Traceback (most recent call last): File "/home/sourasis/anaconda3/envs/venv_name/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==1.12.0', 'console_scripts', 'torchrun')()) File "/home/sourasis/anaconda3/envs/venv_name/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, **kwargs) File "/home/sourasis/anaconda3/envs/venv_name/lib/python3.10/site-packages/torch/distributed/run.py", line 761, in main run(args) File "/home/sourasis/anaconda3/envs/venv_name/lib/python3.10/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/home/sourasis/anaconda3/envs/venv_name/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/sourasis/anaconda3/envs/venv_name/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

============================================================ ./train_gpt_demo.py FAILED

pytorch 1.12.0 cudatoolkit 11.3.1 colossalai 0.2.5

sourasis avatar Feb 23 '23 05:02 sourasis

Hi @joan126 noted, I am looking into this issue. Can you provide more information like pytorch version, cuda version, operating system and version, python version

pytorch : 1.12.1 cuda:11.3 os: Linux version 5.4.217-1-1.e17.elrepo.x86_64 python: 3.7.12

joan126 avatar Feb 23 '23 05:02 joan126

@sourasis you problem is related to your CUDA environment as it cannot find cublas_v2.h. Can you check if your CUDA is available in your system and CUDA_HOME is set? Normally, you can find CUDA in /usr/local.

FrankLeeeee avatar Feb 23 '23 05:02 FrankLeeeee

upgrade gcc version from 4.8.5 to 9.5.0 fix this issue, but met another issue " AttributeError : module 'math' has not attribute prod "

joan126 avatar Feb 23 '23 06:02 joan126

upgrade gcc version from 4.8.5 to 9.5.0 fix this issue, but met another issue " AttributeError : module 'math' has not attribute prod "

Great. The math issue is due to the Python version, try to use Python 3.8 instead with conda create -n <env-name> python=3.8

FrankLeeeee avatar Feb 23 '23 06:02 FrankLeeeee

great...

joan126 avatar Feb 23 '23 06:02 joan126

upgrade gcc version from 4.8.5 to 9.5.0 fix this issue, but met another issue " AttributeError : module 'math' has not attribute prod "

Great. The math issue is due to the Python version, try to use Python 3.8 instead with conda create -n <env-name> python=3.8

@joan126 #2837 has fixed the math issue. Thanks.

binmakeswell avatar Feb 23 '23 16:02 binmakeswell

upgrade gcc version from 4.8.5 to 9.5.0 fix this issue, but met another issue " AttributeError : module 'math' has not attribute prod "

My gcc version is 9.4.0, i meet the same problem, do I have to update gcc to 9.5.0?

yangjianxin1 avatar Feb 23 '23 17:02 yangjianxin1

upgrade gcc version from 4.8.5 to 9.5.0 fix this issue, but met another issue " AttributeError : module 'math' has not attribute prod "

Great. The math issue is due to the Python version, try to use Python 3.8 instead with conda create -n <env-name> python=3.8

Could you please explain on this a bit more? Is there a big difference?

secsilm avatar Feb 24 '23 01:02 secsilm

upgrade gcc version from 4.8.5 to 9.5.0 fix this issue, but met another issue " AttributeError : module 'math' has not attribute prod "

Great. The math issue is due to the Python version, try to use Python 3.8 instead with conda create -n <env-name> python=3.8

Could you please explain on this a bit more? Is there a big difference?

You can check it out #2837 @secsilm

FrankLeeeee avatar Feb 24 '23 01:02 FrankLeeeee

upgrade gcc version from 4.8.5 to 9.5.0 fix this issue, but met another issue " AttributeError : module 'math' has not attribute prod "

Great. The math issue is due to the Python version, try to use Python 3.8 instead with conda create -n <env-name> python=3.8

Could you please explain on this a bit more? Is there a big difference?

You can check it out #2837 @secsilm

Thanks for your quick reply. OK I got it.

secsilm avatar Feb 24 '23 01:02 secsilm

when run examples/gpt/titans/run.sh . I met a new error "no module named colossalai._C.layernorm error " with GCC 9.5.0 version. could you help to see it? @FrankLeeeee

joan126 avatar Feb 24 '23 08:02 joan126

@sourasis you problem is related to your CUDA environment as it cannot find cublas_v2.h. Can you check if your CUDA is available in your system and CUDA_HOME is set? Normally, you can find CUDA in /usr/local.

@FrankLeeeee Issue was when I tried to install nvidia driver 525 it removed cuda. So to counter that I used conda to install cuda. But this installation didnt bring Cublas along with it. My CUDA_HOME is set to this conda installation of Cuda.

I have now installed nvidia-cublas-cu11 separately in conda environment and I can see that the cublas_v2.h file is there but inside a separate directory from the CUDA_HOME. if I can put this directory path as well in -isystem parameter in "build.ninja" (/.cache/colossalai/torch_extensions/torch1.12_cu11.3/build.ninja) then it can work. But I can't find the file which creates build.ninja in runtime.

Can you please tell me which file is creating build.ninja in runtime so that I can plug the cublas/inlcude directory to it as well?

sourasis avatar Feb 24 '23 10:02 sourasis