ChatGLM-6B [BUG/Help] ImportError: /root/.cache/torch_extensions/py310_cu117/utils/utils.so: cannot open shared object file: No such file or directory

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

运行ds_train_finetune.sh始终报这样的错误， File "/home/algo/mzh/ChatGLM-6B-main-0615/ptuning/main_copy.py", line 430, in main() File "/home/algo/mzh/ChatGLM-6B-main-0615/ptuning/main_copy.py", line 369, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/home/algo/mzh/ChatGLM-6B-main-0615/ptuning/trainer.py", line 1635, in train return inner_training_loop( File "/home/algo/mzh/ChatGLM-6B-main-0615/ptuning/trainer.py", line 1704, in _inner_training_loop deepspeed_engine, optimizer, lr_scheduler = deepspeed_init( File "/home/algo/mzh/ChatGLM-6B-main-0615/myenv/lib/python3.10/site-packages/transformers/deepspeed.py", line 378, in deepspeed_init deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs) File "/home/algo/mzh/ChatGLM-6B-main-0615/myenv/lib/python3.10/site-packages/deepspeed/init.py", line 165, in initialize engine = DeepSpeedEngine(args=args, File "/home/algo/mzh/ChatGLM-6B-main-0615/myenv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 309, in init self._configure_optimizer(optimizer, model_parameters) File "/home/algo/mzh/ChatGLM-6B-main-0615/myenv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1185, in _configure_optimizer self.optimizer = self._configure_zero_optimizer(basic_optimizer) File "/home/algo/mzh/ChatGLM-6B-main-0615/myenv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1420, in _configure_zero_optimizer optimizer = DeepSpeedZeroOptimizer( File "/home/algo/mzh/ChatGLM-6B-main-0615/myenv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 154, in init util_ops = UtilsBuilder().load() File "/home/algo/mzh/ChatGLM-6B-main-0615/myenv/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 454, in load return self.jit_load(verbose) File "/home/algo/mzh/ChatGLM-6B-main-0615/myenv/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 497, in jit_load op_module = load(name=self.name, File "/home/algo/mzh/ChatGLM-6B-main-0615/myenv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1284, in load return _jit_compile( File "/home/algo/mzh/ChatGLM-6B-main-0615/myenv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1535, in _jit_compile return _import_module_from_library(name, build_directory, is_python_module) File "/home/algo/mzh/ChatGLM-6B-main-0615/myenv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1929, in _import_module_from_library module = importlib.util.module_from_spec(spec) File "", line 571, in module_from_spec File "", line 1176, in create_module File "", line 241, in _call_with_frames_removed ImportError: /root/.cache/torch_extensions/py310_cu117/utils/utils.so: cannot open shared object file: No such file or directory 我查询了utils/utils.so: 确实没有这个文件。卡在这里好久了，实在不清楚如何解决

Expected Behavior

No response

Steps To Reproduce

PRE_SEQ_LEN=128 LR=1e-4

MASTER_PORT=$(shuf -n 1 -i 10000-65535) deepspeed --include localhost:2,3
--master_port $MASTER_PORT main.py
--deepspeed deepspeed.json
--do_train
--train_file ../../data/AdvertiseGen/debug/train.json
--test_file ../../data/AdvertiseGen/debug/dev.json
--prompt_column content
--response_column summary
--overwrite_cache
--model_name_or_path ../../THUDM/chatglm-6b
--output_dir ./ds/output2/adgen-chatglm-6b-ft-$LR
--overwrite_output_dir
--max_source_length 64
--max_target_length 64
--per_device_train_batch_size 4
--per_device_eval_batch_size 1
--gradient_accumulation_steps 1
--predict_with_generate
--max_steps 5000
--logging_steps 10
--save_steps 1000
--learning_rate $LR
--fp16
--pre_seq_len $PRE_SEQ_LEN
只是修改了文件的路径，其余没有动过，执行脚本出现如上的错误。

Environment

- OS:ubuntu22.04
- Python:3.10.4
- Transformers:4.27.1
- PyTorch:2.0.1
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :True

Anything else?

https://github.com/THUDM/ChatGLM-6B/issues/1154 https://github.com/THUDM/ChatGLM-6B/issues/761 这两个问题也均和我一样，但是似乎都没有解决，有人清楚这边是什么原因吗

Jun 19 '23 16:06 niuhuluzhihao

goto directory /root/.cache/torch_extensions/py310_cu117/utils/, there's a build.ninja file. run ninja command to build the utils.so manually

Jun 21 '23 09:06 cycoe

@cycoe but no 'utils.so' file in this directory .I think that the deepspeed is not builded successfully.what do you think about it?

Jun 22 '23 02:06 niuhuluzhihao

@cycoe but no 'utils.so' file in this directory .I think that the deepspeed is not builded successfully.what do you think about it?

there's no utils.so in this directory becuase of some error while building this library. so u could build it by yourself with command ninja. Also u could figure out the failed reason there

Jun 22 '23 03:06 cycoe

@cycoe ok.I found that the deepspeed did't build successfully beacuse the g++ didn't be installed.When I builded deepspeed successfully，the problem was solved.Thank you very much.

But I found other problems.I am a beginner, and I am not very clear about the operation of multi-gpu based on deepseed. when I run my program,there are two problems: First, why the running time of multiple gpus is close to or even longer than that of a single gpu? Second, why is the video memory of multiple gpus higher than that of a single gpu? Can you help me with this question?

Jun 25 '23 08:06 niuhuluzhihao

安装了一下g++，解决了这个问题

Jul 21 '23 06:07 niuhuluzhihao

安装g++是可行的 yum install -y gcc-c++

Nov 17 '23 02:11 believe563