grantchenhuarong comments

Results 34 comments of


                                            grantchenhuarong

多卡训练报错。。。。

遇到类似情况，请问有解决办法么？ ![437b1bb3e05a1546172f6cb258a6376](https://github.com/yangjianxin1/Firefly/assets/44857880/89051d99-467b-488b-add3-6eca8e00295d)

多卡训练报错。。。。

网上说nohup后台不灵光，如果非正常exit终端的话，会将sigterm信号送给进程，最终导致全部中止。一是退出终端不马上关，使用exit退出；二是看试试这个指令。 $ nohup bash train.sh > train.log 2>&1 & $ disown 这样就算断开连接，命令也会继续运行。

finetune_deepspeed启动运行[ERROR] [launch.py:324:sigkill_handler]

ds_report coredump ![image](https://github.com/Facico/Chinese-Vicuna/assets/44857880/ee1447fe-d8cd-4abb-a8e8-01b3e37dfec3)

finetune_deepspeed启动运行[ERROR] [launch.py:324:sigkill_handler]

dmesg信息查看 https://github.com/microsoft/DeepSpeed/issues/2632 ![image](https://github.com/Facico/Chinese-Vicuna/assets/44857880/4c7e0861-bc7f-46e6-b155-a6f072936aee)

finetune_deepspeed启动运行[ERROR] [launch.py:324:sigkill_handler]

唉， 4090容不下，内存也装载不了，OS这时候干预了。。。 ![image](https://github.com/Facico/Chinese-Vicuna/assets/44857880/2c3c5164-0475-4ef5-a4cf-a121827587eb)

好吧，改单机版本跑就没有这个问题了 python finetune.py --data_path "./sample/merge_sample.json" \ --output_path "lora-Vicuna" \ --model_path "/data/ftp/models/llama/7b" \ --eval_steps 200 \ --save_steps 200 \ --test_size 1 只是11GB跑起来，改小批次参数，不断的GPU。。。OOM 有真正在2080ti上跑起来finetune的兄弟姐妹们么？

在2080ti上运行 finetune提示错误

确认是train的过程正常，但是在保存模型的时候提示异常了。。。 model.save_pretrained(OUTPUT_DIR) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 44.00 MiB (GPU 0; 10.75 GiB total capacity; 9.56 GiB already allocated; 41.50 MiB free; 9.92 GiB reserved in total...

在2080ti上运行 finetune提示错误

transformers从4.28.1降回到4.28.0.dev版本了，结果类似。训练的时候总共占用8.2GB，就是保存模型的时候，应该是在克隆权重的时候，爆内存了。显示如下。 File "finetune.py", line 278, in model.save_pretrained(OUTPUT_DIR) File "/home/ai/.conda/envs/chinesevicuna/lib/python3.8/site-packages/peft/peft_model.py", line 103, in save_pretrained output_state_dict = get_peft_model_state_dict(self, kwargs.get("state_dict", None)) File "/home/ai/.conda/envs/chinesevicuna/lib/python3.8/site-packages/peft/utils/save_and_load.py", line 31, in get_peft_model_state_dict state_dict = model.state_dict() File "finetune.py",...

在2080ti上运行 finetune提示错误

怀疑是在保存模型的时候，是否自动做了精度转换导致的？

在2080ti上运行 finetune提示错误

谢谢，确实是，问题解决了。 (chinesevicuna) ai@ai-2080ti:~/src/Chinese-Vicuna$ pip list|grep bitsandbytes bitsandbytes 0.38.1 (chinesevicuna) ai@ai-2080ti:~/src/Chinese-Vicuna$ pip install bitsandbytes==0.37.2 Collecting bitsandbytes==0.37.2 Using cached bitsandbytes-0.37.2-py3-none-any.whl (84.2 MB) Installing collected packages: bitsandbytes Attempting uninstall: bitsandbytes Found existing installation:...