[BUG/Help] ImportError: /root/.cache/torch_extensions/py310_cu117/utils/utils.so: cannot open shared object file: No such file or directory
Is there an existing issue for this?
- [X] I have searched the existing issues
Current Behavior
运行ds_train_finetune.sh始终报这样的错误,
File "/home/algo/mzh/ChatGLM-6B-main-0615/ptuning/main_copy.py", line 430, in
Expected Behavior
No response
Steps To Reproduce
PRE_SEQ_LEN=128 LR=1e-4
MASTER_PORT=$(shuf -n 1 -i 10000-65535)
deepspeed --include localhost:2,3
--master_port $MASTER_PORT main.py
--deepspeed deepspeed.json
--do_train
--train_file ../../data/AdvertiseGen/debug/train.json
--test_file ../../data/AdvertiseGen/debug/dev.json
--prompt_column content
--response_column summary
--overwrite_cache
--model_name_or_path ../../THUDM/chatglm-6b
--output_dir ./ds/output2/adgen-chatglm-6b-ft-$LR
--overwrite_output_dir
--max_source_length 64
--max_target_length 64
--per_device_train_batch_size 4
--per_device_eval_batch_size 1
--gradient_accumulation_steps 1
--predict_with_generate
--max_steps 5000
--logging_steps 10
--save_steps 1000
--learning_rate $LR
--fp16
--pre_seq_len $PRE_SEQ_LEN
只是修改了文件的路径,其余没有动过,执行脚本出现如上的错误。
Environment
- OS:ubuntu22.04
- Python:3.10.4
- Transformers:4.27.1
- PyTorch:2.0.1
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :True
Anything else?
https://github.com/THUDM/ChatGLM-6B/issues/1154 https://github.com/THUDM/ChatGLM-6B/issues/761 这两个问题也均和我一样,但是似乎都没有解决,有人清楚这边是什么原因吗
goto directory /root/.cache/torch_extensions/py310_cu117/utils/, there's a build.ninja file.
run ninja command to build the utils.so manually
@cycoe but no 'utils.so' file in this directory .I think that the deepspeed is not builded successfully.what do you think about it?
@cycoe but no 'utils.so' file in this directory .I think that the deepspeed is not builded successfully.what do you think about it?
there's no utils.so in this directory becuase of some error while building this library. so u could build it by yourself with command ninja. Also u could figure out the failed reason there
@cycoe ok.I found that the deepspeed did't build successfully beacuse the g++ didn't be installed.When I builded deepspeed successfully,the problem was solved.Thank you very much.
But I found other problems.I am a beginner, and I am not very clear about the operation of multi-gpu based on deepseed. when I run my program,there are two problems: First, why the running time of multiple gpus is close to or even longer than that of a single gpu? Second, why is the video memory of multiple gpus higher than that of a single gpu? Can you help me with this question?
安装了一下g++,解决了这个问题
安装g++是可行的 yum install -y gcc-c++