4M数据时模型打完trainer.train 的log之后长时间卡主没有反应
使用lora训练llama-65b的模型的时候使用4M instruction数据模型无法开始训练,但是减少数据量到0.5M发现可以训练,请问是什么原因呀? 使用8块A100训练,检查了内存没有溢出,GPU显存也还有20G富裕。
日志如下:
2023-04-12 21:01:30 - original_train.py[line:175] - INFO: num_gpus = 8, training_nums = 4112446, t_total = 249238, warmup_steps = 7477
start train...
2023-04-12 21:01:32 - original_train.py[line:175] - INFO: num_gpus = 8, training_nums = 4112446, t_total = 249238, warmup_steps = 7477
Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
trainer.train
trainer.train
trainer.train
trainer.train
trainer.train
trainer.train
trainer.train
trainer.train
我的经验是继续等就行,按几下回车键。。。我用了3.5M的数据训练,数据加载加mapping用了一个半小时之后才开始train。。。
我的经验是继续等就行,按几下回车键。。。我用了3.5M的数据训练,数据加载加mapping用了一个半小时之后才开始train。。。
我们后面会优化数据load和tokenized的过程,尽可能提速。
llama-65b和llama-13b的原始模型,能给我一下吗? 我发现如果想要使用BELLE基于中文优化的模型,是需要原始的llama模型的,否则无法转成需要的格式。
BelleGroup/BELLE-LLaMA-13B-2M-enc
转移脚本:
mkdir /path/to_finetuned_model
for f in "/path/to_encrypted"/*;
do if [ -f "$f" ]; then
python3 decrypt.py "$f" "/path/to_original_llama_13B/consolidated.00.pth" "/path/to_finetuned_model/";
fi;
done
需要提供consolidated.00.pth,这个文件我没有,你们是怎么搞到的?能给我共享一份吗?1.3B、7B、13B、65B的我都需要