BELLE icon indicating copy to clipboard operation
BELLE copied to clipboard

4M数据时模型打完trainer.train 的log之后长时间卡主没有反应

Open muziyongshixin opened this issue 3 years ago • 3 comments

使用lora训练llama-65b的模型的时候使用4M instruction数据模型无法开始训练,但是减少数据量到0.5M发现可以训练,请问是什么原因呀? 使用8块A100训练,检查了内存没有溢出,GPU显存也还有20G富裕。

日志如下: 2023-04-12 21:01:30 - original_train.py[line:175] - INFO: num_gpus = 8, training_nums = 4112446, t_total = 249238, warmup_steps = 7477 start train...
2023-04-12 21:01:32 - original_train.py[line:175] - INFO: num_gpus = 8, training_nums = 4112446, t_total = 249238, warmup_steps = 7477 Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none). Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none). Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none). Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none). Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none). Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none). Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none). Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none). trainer.train trainer.train trainer.train trainer.train trainer.train trainer.train trainer.train trainer.train

muziyongshixin avatar Apr 12 '23 13:04 muziyongshixin

我的经验是继续等就行,按几下回车键。。。我用了3.5M的数据训练,数据加载加mapping用了一个半小时之后才开始train。。。

lemuria-wchen avatar Apr 12 '23 13:04 lemuria-wchen

我的经验是继续等就行,按几下回车键。。。我用了3.5M的数据训练,数据加载加mapping用了一个半小时之后才开始train。。。

我们后面会优化数据load和tokenized的过程,尽可能提速。

xianghuisun avatar Apr 12 '23 14:04 xianghuisun

llama-65b和llama-13b的原始模型,能给我一下吗? 我发现如果想要使用BELLE基于中文优化的模型,是需要原始的llama模型的,否则无法转成需要的格式。

BelleGroup/BELLE-LLaMA-13B-2M-enc

转移脚本: mkdir /path/to_finetuned_model for f in "/path/to_encrypted"/*;
do if [ -f "$f" ]; then
python3 decrypt.py "$f" "/path/to_original_llama_13B/consolidated.00.pth" "/path/to_finetuned_model/";
fi;
done

需要提供consolidated.00.pth,这个文件我没有,你们是怎么搞到的?能给我共享一份吗?1.3B、7B、13B、65B的我都需要

TestNLP avatar Apr 13 '23 12:04 TestNLP