BELLE 4M数据时模型打完trainer.train 的log之后长时间卡主没有反应

使用lora训练llama-65b的模型的时候使用4M instruction数据模型无法开始训练，但是减少数据量到0.5M发现可以训练，请问是什么原因呀？使用8块A100训练，检查了内存没有溢出，GPU显存也还有20G富裕。

日志如下： 2023-04-12 21:01:30 - original_train.py[line:175] - INFO: num_gpus = 8, training_nums = 4112446, t_total = 249238, warmup_steps = 7477 start train...
2023-04-12 21:01:32 - original_train.py[line:175] - INFO: num_gpus = 8, training_nums = 4112446, t_total = 249238, warmup_steps = 7477 Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none). Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none). Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none). Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none). Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none). Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none). Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none). Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none). trainer.train trainer.train trainer.train trainer.train trainer.train trainer.train trainer.train trainer.train

Apr 12 '23 13:04 muziyongshixin

我的经验是继续等就行，按几下回车键。。。我用了3.5M的数据训练，数据加载加mapping用了一个半小时之后才开始train。。。

Apr 12 '23 13:04 lemuria-wchen

我的经验是继续等就行，按几下回车键。。。我用了3.5M的数据训练，数据加载加mapping用了一个半小时之后才开始train。。。

我们后面会优化数据load和tokenized的过程，尽可能提速。

Apr 12 '23 14:04 xianghuisun

llama-65b和llama-13b的原始模型，能给我一下吗？我发现如果想要使用BELLE基于中文优化的模型，是需要原始的llama模型的，否则无法转成需要的格式。

BelleGroup/BELLE-LLaMA-13B-2M-enc

转移脚本： mkdir /path/to_finetuned_model for f in "/path/to_encrypted"/*;
do if [ -f "$f" ]; then
python3 decrypt.py "$f" "/path/to_original_llama_13B/consolidated.00.pth" "/path/to_finetuned_model/";
fi;
done

需要提供consolidated.00.pth，这个文件我没有，你们是怎么搞到的？能给我共享一份吗？1.3B、7B、13B、65B的我都需要

Apr 13 '23 12:04 TestNLP