萧停云

Results 8 comments of 萧停云

> > 3.0.0 may require cuda 11, you could checkout 2.2.1 version and try. > > I try cuda11 and pytorch 1.11.0 , python lightseq/examples/inference/python/export/huggingface/hf_bert_export.py (tag 3.0.1), the problem still...

使用run_finetune_with_lora.sh时单卡能够进行到模型训练阶段,但会报错。双卡则在数据处理阶段卡住。 以下是单卡的日志,A6000,48G显存 [2023-04-09 06:06:04,824] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-04-09 06:06:04,839] [INFO] [runner.py:550:main] cmd = /data/anaconda3/envs/ljy_lmflow/bin/python3.9 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1...

使用run_finetune.sh时,单卡训练过程报错,双卡模型加载阶段卡住 以下是单卡日志 [2023-04-09 06:37:33,738] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-04-09 06:37:33,754] [INFO] [runner.py:550:main] cmd = /data/anaconda3/envs/ljy_lmflow/bin/python3.9 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1...

> Thanks for your interest in LMFlow! Could you please check `log/finetune/train.err` to see the detailed error message? Also, it would be nice if you could provide the hardware settings...

1. I also encountered the same problem. Your solution is effective for opt-1.3B, But when training gpt-3.5B, stuck in the loop for a long time. The larger the model, the...

@tjruwase I can run through the method of 00INDEX. However, But if I don't modify the source code, as long as offload is turned on, whether it is zero-2 or...

此外,预测时,需要在原始输入的前面加入bos_token吗