ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

多机多卡训练容易超时,超时的话如何自动从已经保存的模型恢复训练?

Open jiejie1993 opened this issue 1 year ago • 2 comments

Discussed in https://github.com/hpcaitech/ColossalAI/discussions/5027

Originally posted by jiejie1993 November 8, 2023 多机多卡训练过程中,发生NCCL timeout超时,在torch中有--max-restarts对训练进行重启,但是如何去自动加载最新的已经保存的模型?使用--load-checkpoint需要多节点都有这个保存的模型,但训练中只会在master节点保存模型,手动复制到所有节点的话无法实现训练自动重启,有没有什么办法实现自动重启中断的训练,并从已经保存的最新模型恢复的功能?

jiejie1993 avatar Nov 08 '23 12:11 jiejie1993

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Title: Multi-machine and multi-card training is prone to timeout. If it times out, how to automatically resume training from the saved model?

Issues-translate-bot avatar Nov 08 '23 12:11 Issues-translate-bot

any update?

xs1997zju avatar Nov 13 '23 03:11 xs1997zju