Megatron-LM
Megatron-LM copied to clipboard
[BUG] How to checkpoint the specific microbatch in pipeline parallelism?
Your question Ask a clear and concise question about Megatron-LM.
I saw there is a microbatch-level checkpointing implementation of https://arxiv.org/pdf/2205.05198.pdf in schedules.py. But I do not know how to enable it with the arguments: num_microbatches_with_partial_activation_checkpoints, checkpoint_activations_microbatch
I tried to set a batch number, but got an error " TypeError: forward_step() takes 2 positional arguments but 3 were given"
Does it mean we have not implemented the microbatch-level checkpointing yet? @sublee @jaredcasper @aaronp24 @dweekly
@deepakn94 Hi deepakn, do you know how to solve the problem?