dlrover icon indicating copy to clipboard operation
dlrover copied to clipboard

What is the relationship with DLRover and Megatron? Can I integrate DLRover with Megatron with fault-tolerance and monitoring capabilities. How DLRover can recover from GPU offline problems with TP and PP needing to be reorganized?

Open dotsonliu opened this issue 1 year ago • 1 comments

dotsonliu avatar Aug 19 '24 07:08 dotsonliu

Thank you for using DLRover. I've transfer your headline into English and please send issues in English in future.

Have a good day

majieyue avatar Sep 19 '24 03:09 majieyue

Yes u can. The core of fault tolerance relies on redundancy. When there is failure during training, DLRover will automatically restart or relaunch the target process/container, try to keep training.

If u got a checkpoint of ur training model, u can start a new training with reorganized params.

BalaBalaYi avatar Nov 18 '24 11:11 BalaBalaYi