dlrover
dlrover copied to clipboard
What is the relationship with DLRover and Megatron? Can I integrate DLRover with Megatron with fault-tolerance and monitoring capabilities. How DLRover can recover from GPU offline problems with TP and PP needing to be reorganized?
Thank you for using DLRover. I've transfer your headline into English and please send issues in English in future.
Have a good day
Yes u can. The core of fault tolerance relies on redundancy. When there is failure during training, DLRover will automatically restart or relaunch the target process/container, try to keep training.
If u got a checkpoint of ur training model, u can start a new training with reorganized params.