ChatGLM-6B icon indicating copy to clipboard operation
ChatGLM-6B copied to clipboard

[Help]多卡训练的时候总说cache/torch_extensions/py38_cu113/utils/utils.so: cannot open shared object file: No such file or directory

Open shishijier opened this issue 2 years ago • 4 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Current Behavior

Loading extension module utils... Traceback (most recent call last): File "main.py", line 431, in main() File "main.py", line 370, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/disk1/shisj/project/ChatGLM-6B/ptuning/trainer.py", line 1635, in train return inner_training_loop( File "/disk1/shisj/project/ChatGLM-6B/ptuning/trainer.py", line 1704, in _inner_training_loop deepspeed_engine, optimizer, lr_scheduler = deepspeed_init( File "/disk1/shisj/anaconda3/envs/glm/lib/python3.8/site-packages/transformers/deepspeed.py", line 378, in deepspeed_init deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs) File "/disk1/shisj/anaconda3/envs/glm/lib/python3.8/site-packages/deepspeed/init.py", line 165, in initialize engine = DeepSpeedEngine(args=args, File "/disk1/shisj/anaconda3/envs/glm/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 308, in init self._configure_optimizer(optimizer, model_parameters) File "/disk1/shisj/anaconda3/envs/glm/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1167, in _configure_optimizer self.optimizer = self._configure_zero_optimizer(basic_optimizer) File "/disk1/shisj/anaconda3/envs/glm/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1398, in _configure_zero_optimizer optimizer = DeepSpeedZeroOptimizer( File "/disk1/shisj/anaconda3/envs/glm/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 154, in init util_ops = UtilsBuilder().load() File "/disk1/shisj/anaconda3/envs/glm/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 445, in load return self.jit_load(verbose) File "/disk1/shisj/anaconda3/envs/glm/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load op_module = load(name=self.name, File "/disk1/shisj/anaconda3/envs/glm/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1202, in load return _jit_compile( File "/disk1/shisj/anaconda3/envs/glm/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1450, in _jit_compile return _import_module_from_library(name, build_directory, is_python_module) File "/disk1/shisj/anaconda3/envs/glm/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1844, in _import_module_from_library module = importlib.util.module_from_spec(spec) File "", line 556, in module_from_spec File "", line 1101, in create_module File "", line 219, in _call_with_frames_removed ImportError: /disk1/shisj/cache/torch_extensions/py38_cu113/utils/utils.so: cannot open shared object file: No such file or directory

多卡训练,显示找不到utils.so这个文件

Expected Behavior

No response

Steps To Reproduce

Environment

- OS:Centos 7.9.2009
- Python:3.8
- Transformers:4.27.1
- PyTorch:1.12.1
-CUDA:11.3
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :True

Anything else?

No response

shishijier avatar Apr 22 '23 16:04 shishijier

Make sure you have installed ninja, You can install it by conda install ninja

Chiang97912 avatar Apr 24 '23 04:04 Chiang97912

我也遇到了同样的问题,请问你现在有解决吗?

roki1031 avatar Jun 06 '23 08:06 roki1031

Make sure you have installed ninja, You can install it by conda install ninja

I run ninja --version and the result is 1.11.1.git.kitware.jobserver-1

roki1031 avatar Jun 06 '23 09:06 roki1031

goto the /disk1/shisj/cache/torch_extensions/py38_cu113/utils/ directory, then compile utils.so manully with ninja

cycoe avatar Jun 21 '23 09:06 cycoe

goto the /disk1/shisj/cache/torch_extensions/py38_cu113/utils/ directory, then compile utils.so manully with ninja

Nothing in this folder; PS, I reinstall ninja, and it worked! still don't know why

eziohzy avatar Jun 30 '23 04:06 eziohzy