萧停云
萧停云
> 使用的是发行版本3.0.0
> > 3.0.0 may require cuda 11, you could checkout 2.2.1 version and try. > > I try cuda11 and pytorch 1.11.0 , python lightseq/examples/inference/python/export/huggingface/hf_bert_export.py (tag 3.0.1), the problem still...
使用run_finetune_with_lora.sh时单卡能够进行到模型训练阶段,但会报错。双卡则在数据处理阶段卡住。 以下是单卡的日志,A6000,48G显存 [2023-04-09 06:06:04,824] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-04-09 06:06:04,839] [INFO] [runner.py:550:main] cmd = /data/anaconda3/envs/ljy_lmflow/bin/python3.9 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1...
使用run_finetune.sh时,单卡训练过程报错,双卡模型加载阶段卡住 以下是单卡日志 [2023-04-09 06:37:33,738] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-04-09 06:37:33,754] [INFO] [runner.py:550:main] cmd = /data/anaconda3/envs/ljy_lmflow/bin/python3.9 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1...
> Thanks for your interest in LMFlow! Could you please check `log/finetune/train.err` to see the detailed error message? Also, it would be nice if you could provide the hardware settings...
1. I also encountered the same problem. Your solution is effective for opt-1.3B, But when training gpt-3.5B, stuck in the loop for a long time. The larger the model, the...
@tjruwase I can run through the method of 00INDEX. However, But if I don't modify the source code, as long as offload is turned on, whether it is zero-2 or...
此外,预测时,需要在原始输入的前面加入bos_token吗