ColossalAI
                                
                                 ColossalAI copied to clipboard
                                
                                    ColossalAI copied to clipboard
                            
                            
                            
                        多机多卡训练容易超时,超时的话如何自动从已经保存的模型恢复训练?
Discussed in https://github.com/hpcaitech/ColossalAI/discussions/5027
Originally posted by jiejie1993 November 8, 2023 多机多卡训练过程中,发生NCCL timeout超时,在torch中有--max-restarts对训练进行重启,但是如何去自动加载最新的已经保存的模型?使用--load-checkpoint需要多节点都有这个保存的模型,但训练中只会在master节点保存模型,手动复制到所有节点的话无法实现训练自动重启,有没有什么办法实现自动重启中断的训练,并从已经保存的最新模型恢复的功能?
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
Title: Multi-machine and multi-card training is prone to timeout. If it times out, how to automatically resume training from the saved model?
any update?