ColossalAI
ColossalAI copied to clipboard
[BUG]: ModuleNotFoundError: No module named 'colossalai._C.cpu_adam'
๐ Describe the bug
- ่ฟ่กsh examples/train_sft.sh
- ๆฅ้ไฟกๆฏๅฆไธ๏ผ
[04/19/23 15:25:30] INFO colossalai - colossalai - INFO: /home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/colossalai/context/parallel_context.py:522 set_device
INFO colossalai - colossalai - INFO: process rank 0 is bound to device 0
[04/19/23 15:25:31] INFO colossalai - colossalai - INFO: /home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/colossalai/context/parallel_context.py:558 set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 42, python random: 42, ParallelMode.DATA: 42, ParallelMode.TENSOR: 42,the default parallel seed is ParallelMode.DATA.
INFO colossalai - colossalai - INFO: /home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/colossalai/initialize.py:115 launch
INFO colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 1, pipeline parallel size: 1, tensor parallel size: 1
Loading checkpoint shards: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 33/33 [00:15<00:00, 2.18it/s]
Traceback (most recent call last):
File "/home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py", line 161, in load
op_module = self.import_op()
File "/home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py", line 110, in import_op
return importlib.import_module(self.prebuilt_import_path)
File "/usr/lib/python3.8/importlib/init.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1900, in _run_ninja_build subprocess.run( File "/usr/lib/python3.8/subprocess.py", line 516, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "examples/train_sft.py", line 189, in
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 439899) of binary: /home/jovyan/work/projects/Example/ColossalAI/venv/bin/python
Traceback (most recent call last):
File "/home/jovyan/work/projects/Example/ColossalAI/venv/bin/torchrun", line 8, in
sys.exit(main())
File "/home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/jovyan/work/projects/Example/ColossalAI/venv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
examples/train_sft.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2023-04-19_15:26:57 host : b583500e367e rank : 0 (local_rank: 0) exitcode : 1 (pid: 439899) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
ๆๅทฒ็ป็ไบ่ฟๅ ไธช้ฝๆฒกๆ่งฃๅณๆ็้ฎ้ข๏ผๅนถไธ็จCUDA_EXT=1 pip install . ็ผ่ฏๆๅไบ๏ผไฝ่ฟๆฏๆฅไธ้ข็้่ฏฏ๏ผ่ฏท้ฎๆฏไปไนๅๅ ๅข
Environment
gcc็ๆฌ9.3.0
Did you run pip install colossalai
before running pip install .
in the folder? They may have introduced confusion. You might want to pip uninstall colossalai
.
I have the same error while running the language/OPT example using run_gemini.sh with colossalai version 0.3.0. Is there any fix for this?
same issues when running gpt2 example in repo.
I've met the same issue using "CUDA_EXT=1 pip install",