Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

[BUG] How to checkpoint the specific microbatch in pipeline parallelism?

Open robotsp opened this issue 3 months ago • 2 comments

Your question Ask a clear and concise question about Megatron-LM.

I saw there is a microbatch-level checkpointing implementation of https://arxiv.org/pdf/2205.05198.pdf in schedules.py. But I do not know how to enable it with the arguments: num_microbatches_with_partial_activation_checkpoints, checkpoint_activations_microbatch

robotsp avatar Apr 07 '24 14:04 robotsp

I tried to set a batch number, but got an error " TypeError: forward_step() takes 2 positional arguments but 3 were given"

Does it mean we have not implemented the microbatch-level checkpointing yet? @sublee @jaredcasper @aaronp24 @dweekly

robotsp avatar Apr 08 '24 03:04 robotsp

@deepakn94 Hi deepakn, do you know how to solve the problem?

robotsp avatar Apr 15 '24 05:04 robotsp