ColossalAI 多机多卡训练容易超时，超时的话如何自动从已经保存的模型恢复训练？

多机多卡训练容易超时，超时的话如何自动从已经保存的模型恢复训练？

Open jiejie1993 opened this issue 1 year ago • 2 comments

Discussed in https://github.com/hpcaitech/ColossalAI/discussions/5027

^{Originally posted by jiejie1993 November 8, 2023} 多机多卡训练过程中，发生NCCL timeout超时，在torch中有--max-restarts对训练进行重启，但是如何去自动加载最新的已经保存的模型？使用--load-checkpoint需要多节点都有这个保存的模型，但训练中只会在master节点保存模型，手动复制到所有节点的话无法实现训练自动重启，有没有什么办法实现自动重启中断的训练，并从已经保存的最新模型恢复的功能？

Nov 08 '23 12:11 jiejie1993

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

Title: Multi-machine and multi-card training is prone to timeout. If it times out, how to automatically resume training from the saved model?

Nov 08 '23 12:11 Issues-translate-bot

any update?

Nov 13 '23 03:11 xs1997zju

ColossalAI ColossalAI copied to clipboard

多机多卡训练容易超时，超时的话如何自动从已经保存的模型恢复训练？

Discussed in https://github.com/hpcaitech/ColossalAI/discussions/5027

ColossalAI
ColossalAI copied to clipboard